Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o: - Convert content from the ext4 wiki to Documentation rst files so it is more likely to be updated as we add new features to ext4. - Add 64-bit timestamp support to ext4's superblock fields. - ... and the usual bug fixes and cleanups, including a Spectre gadget fixup and some hardening against maliciously corrupted file systems. * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (34 commits) ext4: remove unneeded variable "err" in ext4_mb_release_inode_pa() ext4: improve code readability in ext4_iget() ext4: fix spectre gadget in ext4_mb_regular_allocator() ext4: check for NUL characters in extended attribute's name ext4: use ext4_warning() for sb_getblk failure ext4: fix race when setting the bitmap corrupted flag ext4: reset error code in ext4_find_entry in fallback ext4: handle layout changes to pinned DAX mappings dax: dax_layout_busy_page() warn on !exceptional docs: fix up the obviously obsolete bits in the new ext4 documentation docs: add new ext4 superblock time extension fields docs: create filesystem internal section ext4: use swap macro in mext_page_double_lock ext4: check allocation failure when duplicating "data" in ext4_remount() ext4: fix warning message in ext4_enable_quotas() ext4: super: extend timestamps to 40 bits jbd2: replace current_kernel_time64 with ktime equivalent ext4: use timespec64 for all inode times ext4: use ktime_get_real_seconds for i_dtime ext4: use 64-bit timestamps for mmp_time ...
author: Linus Torvalds <torvalds@linux-foundation.org> 2018-08-13 22:34:47 -0700
committer: Linus Torvalds <torvalds@linux-foundation.org> 2018-08-13 22:34:47 -0700
commit: 10f3e23f07cb0c20f9bcb77a5b5a7eb2a1b2a2fe (patch)
tree: 1fcb34309b3542512c6f3345f092f7adb8c3312c /Documentation
parent: 3bb37da509e576c80180fa0e4d1cfcaddf0cb82e (diff)
parent: 863c37fcb14f8b66ea831b45fb35a53ac4a8d69e (diff)
27 files changed, 3840 insertions, 79 deletions
diff --git a/Documentation/conf.py b/Documentation/conf.py
index 62ac5a9f3a9f..b691af4831fa 100644
--- a/Documentation/conf.py
+++ b/Documentation/conf.py
@@ -34,7 +34,7 @@ needs_sphinx = '1.3'
 # Add any Sphinx extension module names here, as strings. They can be
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
-extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include', 'cdomain', 'kfigure']
+extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include', 'cdomain', 'kfigure', 'sphinx.ext.ifconfig']
 
 # The name of the math extension changed on Sphinx 1.4
 if major == 1 and minor > 3:
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4/ext4.rst
index 7f628b9f7c4b..9d4368d591fa 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4/ext4.rst
@@ -1,6 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
 
-Ext4 Filesystem
-===============
+========================
+General Information
+========================
 
 Ext4 is an advanced level of the ext3 filesystem which incorporates
 scalability and reliability enhancements for supporting large filesystems
@@ -11,37 +13,30 @@ Mailing list:	linux-ext4@vger.kernel.org
 Web site:	http://ext4.wiki.kernel.org
 
 
-1. Quick usage instructions:
-===========================
+Quick usage instructions
+========================
 
 Note: More extensive information for getting started with ext4 can be
-      found at the ext4 wiki site at the URL:
-      http://ext4.wiki.kernel.org/index.php/Ext4_Howto
+found at the ext4 wiki site at the URL:
+http://ext4.wiki.kernel.org/index.php/Ext4_Howto
 
-  - Compile and install the latest version of e2fsprogs (as of this
-    writing version 1.41.3) from:
+  - The latest version of e2fsprogs can be found at:
+
+    https://www.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
 
-    http://sourceforge.net/project/showfiles.php?group_id=2406
-	
 	or
 
-    https://www.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
+    http://sourceforge.net/project/showfiles.php?group_id=2406
 
 	or grab the latest git repository from:
 
-    git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
-
-  - Note that it is highly important to install the mke2fs.conf file
-    that comes with the e2fsprogs 1.41.x sources in /etc/mke2fs.conf. If
-    you have edited the /etc/mke2fs.conf file installed on your system,
-    you will need to merge your changes with the version from e2fsprogs
-    1.41.x.
+   https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
 
   - Create a new filesystem using the ext4 filesystem type:
 
-    	# mke2fs -t ext4 /dev/hda1
+        # mke2fs -t ext4 /dev/hda1
 
-    Or to configure an existing ext3 filesystem to support extents: 
+    Or to configure an existing ext3 filesystem to support extents:
 
 	# tune2fs -O extents /dev/hda1
 
@@ -50,10 +45,6 @@ Note: More extensive information for getting started with ext4 can be
 
         # tune2fs -I 256 /dev/hda1
 
-    (Note: we currently do not have tools to convert an ext4
-    filesystem back to ext3; so please do not do try this on production
-    filesystems.)
-
   - Mounting:
 
 	# mount -t ext4 /dev/hda1 /wherever
@@ -75,10 +66,11 @@ Note: More extensive information for getting started with ext4 can be
     the filesystem with a large journal can also be helpful for
     metadata-intensive workloads.
 
-2. Features
-===========
+Features
+========
 
-2.1 Currently available
+Currently Available
+-------------------
 
 * ability to use filesystems > 16TB (e2fsprogs support not available yet)
 * extent format reduces metadata overhead (RAM, IO for access, transactions)
@@ -103,31 +95,15 @@ Note: More extensive information for getting started with ext4 can be
 [1] Filesystems with a block size of 1k may see a limit imposed by the
 directory hash tree having a maximum depth of two.
 
-2.2 Candidate features for future inclusion
-
-* online defrag (patches available but not well tested)
-* reduced mke2fs time via lazy itable initialization in conjunction with
-  the uninit_bg feature (capability to do this is available in e2fsprogs
-  but a kernel thread to do lazy zeroing of unused inode table blocks
-  after filesystem is first mounted is required for safety)
-
-There are several others under discussion, whether they all make it in is
-partly a function of how much time everyone has to work on them. Features like
-metadata checksumming have been discussed and planned for a bit but no patches
-exist yet so I'm not sure they're in the near-term roadmap.
-
-The big performance win will come with mballoc, delalloc and flex_bg
-grouping of bitmaps and inode tables.  Some test results available here:
-
- - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-write-2.6.27-rc1.html
- - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-readwrite-2.6.27-rc1.html
-
-3. Options
-==========
+Options
+=======
 
 When mounting an ext4 filesystem, the following option are accepted:
 (*) == default
 
+======================= =======================================================
+Mount Option            Description
+======================= =======================================================
 ro                   	Mount filesystem read only. Note that ext4 will
                      	replay the journal (and thus write to the
                      	partition) even when mounted "read only". The
@@ -387,33 +363,38 @@ i_version		Enable 64-bit inode version support. This option is
 dax			Use direct access (no page cache).  See
 			Documentation/filesystems/dax.txt.  Note that
 			this option is incompatible with data=journal.
+======================= =======================================================
 
 Data Mode
 =========
 There are 3 different data modes:
 
 * writeback mode
-In data=writeback mode, ext4 does not journal data at all.  This mode provides
-a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
-mode - metadata journaling.  A crash+recovery can cause incorrect data to
-appear in files which were written shortly before the crash.  This mode will
-typically provide the best ext4 performance.
+
+  In data=writeback mode, ext4 does not journal data at all.  This mode provides
+  a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
+  mode - metadata journaling.  A crash+recovery can cause incorrect data to
+  appear in files which were written shortly before the crash.  This mode will
+  typically provide the best ext4 performance.
 
 * ordered mode
-In data=ordered mode, ext4 only officially journals metadata, but it logically
-groups metadata information related to data changes with the data blocks into a
-single unit called a transaction.  When it's time to write the new metadata
-out to disk, the associated data blocks are written first.  In general,
-this mode performs slightly slower than writeback but significantly faster than journal mode.
+
+  In data=ordered mode, ext4 only officially journals metadata, but it logically
+  groups metadata information related to data changes with the data blocks into
+  a single unit called a transaction.  When it's time to write the new metadata
+  out to disk, the associated data blocks are written first.  In general, this
+  mode performs slightly slower than writeback but significantly faster than
+  journal mode.
 
 * journal mode
-data=journal mode provides full data and metadata journaling.  All new data is
-written to the journal first, and then to its final location.
-In the event of a crash, the journal can be replayed, bringing both data and
-metadata into a consistent state.  This mode is the slowest except when data
-needs to be read from and written to disk at the same time where it
-outperforms all others modes.  Enabling this mode will disable delayed
-allocation and O_DIRECT support.
+
+  data=journal mode provides full data and metadata journaling.  All new data is
+  written to the journal first, and then to its final location.  In the event of
+  a crash, the journal can be replayed, bringing both data and metadata into a
+  consistent state.  This mode is the slowest except when data needs to be read
+  from and written to disk at the same time where it outperforms all others
+  modes.  Enabling this mode will disable delayed allocation and O_DIRECT
+  support.
 
 /proc entries
 =============
@@ -425,10 +406,12 @@ Information about mounted ext4 file systems can be found in
 in table below.
 
 Files in /proc/fs/ext4/<devname>
-..............................................................................
+
+================ =======
  File            Content
+================ =======
  mb_groups       details of multiblock allocator buddy cache of free blocks
-..............................................................................
+================ =======
 
 /sys entries
 ============
@@ -439,28 +422,30 @@ Information about mounted ext4 file systems can be found in
 /sys/fs/ext4/dm-0).   The files in each per-device directory are shown
 in table below.
 
-Files in /sys/fs/ext4/<devname>
+Files in /sys/fs/ext4/<devname>:
+
 (see also Documentation/ABI/testing/sysfs-fs-ext4)
-..............................................................................
- File                         Content
 
+============================= =================================================
+File                          Content
+============================= =================================================
  delayed_allocation_blocks    This file is read-only and shows the number of
                               blocks that are dirty in the page cache, but
                               which do not have their location in the
                               filesystem allocated yet.
 
- inode_goal                   Tuning parameter which (if non-zero) controls
+inode_goal                    Tuning parameter which (if non-zero) controls
                               the goal inode used by the inode allocator in
                               preference to all other allocation heuristics.
                               This is intended for debugging use only, and
                               should be 0 on production systems.
 
- inode_readahead_blks         Tuning parameter which controls the maximum
+inode_readahead_blks          Tuning parameter which controls the maximum
                               number of inode table blocks that ext4's inode
                               table readahead algorithm will pre-read into
                               the buffer cache
 
- lifetime_write_kbytes        This file is read-only and shows the number of
+lifetime_write_kbytes         This file is read-only and shows the number of
                               kilobytes of data that have been written to this
                               filesystem since it was created.
 
@@ -508,7 +493,7 @@ Files in /sys/fs/ext4/<devname>
                               in the file system. If there is not enough space
                               for the reserved space when mounting the file
                               mount will _not_ fail.
-..............................................................................
+============================= =================================================
 
 Ioctls
 ======
@@ -518,8 +503,10 @@ through the system call interfaces. The list of all Ext4 specific ioctls are
 shown in the table below.
 
 Table of Ext4 specific ioctls
-..............................................................................
- Ioctl			      Description
+
+============================= =================================================
+Ioctl			      Description
+============================= =================================================
  EXT4_IOC_GETFLAGS	      Get additional attributes associated with inode.
 			      The ioctl argument is an integer bitfield, with
 			      bit values described in ext4.h. This ioctl is an
@@ -610,8 +597,7 @@ Table of Ext4 specific ioctls
 			      normal user by accident.
 			      The data blocks of the previous boot loader
 			      will be associated with the given inode.
-
-..............................................................................
+============================= =================================================
 
 References
 ==========
diff --git a/Documentation/filesystems/ext4/index.rst b/Documentation/filesystems/ext4/index.rst
new file mode 100644
index 000000000000..71121605558c
--- /dev/null
+++ b/Documentation/filesystems/ext4/index.rst
@@ -0,0 +1,17 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+ext4 Filesystem
+===============
+
+General usage and on-disk artifacts writen by ext4.  More documentation may
+be ported from the wiki as time permits.  This should be considered the
+canonical source of information as the details here have been reviewed by
+the ext4 community.
+
+.. toctree::
+   :maxdepth: 5
+   :numbered:
+
+   ext4
+   ondisk/index
diff --git a/Documentation/filesystems/ext4/ondisk/about.rst b/Documentation/filesystems/ext4/ondisk/about.rst
new file mode 100644
index 000000000000..0aadba052264
--- /dev/null
+++ b/Documentation/filesystems/ext4/ondisk/about.rst
@@ -0,0 +1,44 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+About this Book
+===============
+
+This document attempts to describe the on-disk format for ext4
+filesystems. The same general ideas should apply to ext2/3 filesystems
+as well, though they do not support all the features that ext4 supports,
+and the fields will be shorter.
+
+**NOTE**: This is a work in progress, based on notes that the author
+(djwong) made while picking apart a filesystem by hand. The data
+structure definitions should be current as of Linux 4.18 and
+e2fsprogs-1.44. All comments and corrections are welcome, since there is
+undoubtedly plenty of lore that might not be reflected in freshly
+created demonstration filesystems.
+
+License
+-------
+This book is licensed under the terms of the GNU Public License, v2.
+
+Terminology
+-----------
+
+ext4 divides a storage device into an array of logical blocks both to
+reduce bookkeeping overhead and to increase throughput by forcing larger
+transfer sizes. Generally, the block size will be 4KiB (the same size as
+pages on x86 and the block layer's default block size), though the
+actual size is calculated as 2 ^ (10 + ``sb.s_log_block_size``) bytes.
+Throughout this document, disk locations are given in terms of these
+logical blocks, not raw LBAs, and not 1024-byte blocks. For the sake of
+convenience, the logical block size will be referred to as
+``$block_size`` throughout the rest of the document.
+
+When referenced in ``preformatted text`` blocks, ``sb`` refers to fields
+in the super block, and ``inode`` refers to fields in an inode table
+entry.
+
+Other References
+----------------
+
+Also see http://www.nongnu.org/ext2-doc/ for quite a collection of
+information about ext2/3. Here's another old reference:
+http://wiki.osdev.org/Ext2
diff --git a/Documentation/filesystems/ext4/ondisk/allocators.rst b/Documentation/filesystems/ext4/ondisk/allocators.rst
new file mode 100644
index 000000000000..7aa85152ace3
--- /dev/null
+++ b/Documentation/filesystems/ext4/ondisk/allocators.rst
@@ -0,0 +1,56 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Block and Inode Allocation Policy
+---------------------------------
+
+ext4 recognizes (better than ext3, anyway) that data locality is
+generally a desirably quality of a filesystem. On a spinning disk,
+keeping related blocks near each other reduces the amount of movement
+that the head actuator and disk must perform to access a data block,
+thus speeding up disk IO. On an SSD there of course are no moving parts,
+but locality can increase the size of each transfer request while
+reducing the total number of requests. This locality may also have the
+effect of concentrating writes on a single erase block, which can speed
+up file rewrites significantly. Therefore, it is useful to reduce
+fragmentation whenever possible.
+
+The first tool that ext4 uses to combat fragmentation is the multi-block
+allocator. When a file is first created, the block allocator
+speculatively allocates 8KiB of disk space to the file on the assumption
+that the space will get written soon. When the file is closed, the
+unused speculative allocations are of course freed, but if the
+speculation is correct (typically the case for full writes of small
+files) then the file data gets written out in a single multi-block
+extent. A second related trick that ext4 uses is delayed allocation.
+Under this scheme, when a file needs more blocks to absorb file writes,
+the filesystem defers deciding the exact placement on the disk until all
+the dirty buffers are being written out to disk. By not committing to a
+particular placement until it's absolutely necessary (the commit timeout
+is hit, or sync() is called, or the kernel runs out of memory), the hope
+is that the filesystem can make better location decisions.
+
+The third trick that ext4 (and ext3) uses is that it tries to keep a
+file's data blocks in the same block group as its inode. This cuts down
+on the seek penalty when the filesystem first has to read a file's inode
+to learn where the file's data blocks live and then seek over to the
+file's data blocks to begin I/O operations.
+
+The fourth trick is that all the inodes in a directory are placed in the
+same block group as the directory, when feasible. The working assumption
+here is that all the files in a directory might be related, therefore it
+is useful to try to keep them all together.
+
+The fifth trick is that the disk volume is cut up into 128MB block
+groups; these mini-containers are used as outlined above to try to
+maintain data locality. However, there is a deliberate quirk -- when a
+directory is created in the root directory, the inode allocator scans
+the block groups and puts that directory into the least heavily loaded
+block group that it can find. This encourages directories to spread out
+over a disk; as the top-level directory/file blobs fill up one block
+group, the allocators simply move on to the next block group. Allegedly
+this scheme evens out the loading on the block groups, though the author
+suspects that the directories which are so unlucky as to land towards
+the end of a spinning drive get a raw deal performance-wise.
+
+Of course if all of these mechanisms fail, one can always use e4defrag
+to defragment files.
diff --git a/Documentation/filesystems/ext4/ondisk/attributes.rst b/Documentation/filesystems/ext4/ondisk/attributes.rst
new file mode 100644
index 000000000000..0b01b67b81fe
--- /dev/null
+++ b/Documentation/filesystems/ext4/ondisk/attributes.rst
@@ -0,0 +1,191 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Extended Attributes
+-------------------
+
+Extended attributes (xattrs) are typically stored in a separate data
+block on the disk and referenced from inodes via ``inode.i_file_acl*``.
+The first use of extended attributes seems to have been for storing file
+ACLs and other security data (selinux). With the ``user_xattr`` mount
+option it is possible for users to store extended attributes so long as
+all attribute names begin with “user”; this restriction seems to have
+disappeared as of Linux 3.0.
+
+There are two places where extended attributes can be found. The first
+place is between the end of each inode entry and the beginning of the
+next inode entry. For example, if inode.i\_extra\_isize = 28 and
+sb.inode\_size = 256, then there are 256 - (128 + 28) = 100 bytes
+available for in-inode extended attribute storage. The second place
+where extended attributes can be found is in the block pointed to by
+``inode.i_file_acl``. As of Linux 3.11, it is not possible for this
+block to contain a pointer to a second extended attribute block (or even
+the remaining blocks of a cluster). In theory it is possible for each
+attribute's value to be stored in a separate data block, though as of
+Linux 3.11 the code does not permit this.
+
+Keys are generally assumed to be ASCIIZ strings, whereas values can be
+strings or binary data.
+
+Extended attributes, when stored after the inode, have a header
+``ext4_xattr_ibody_header`` that is 4 bytes long:
+
+.. list-table::
+   :widths: 1 1 1 77
+   :header-rows: 1
+
+   * - Offset
+     - Type
+     - Name
+     - Description
+   * - 0x0
+     - \_\_le32
+     - h\_magic
+     - Magic number for identification, 0xEA020000. This value is set by the
+       Linux driver, though e2fsprogs doesn't seem to check it(?)
+
+The beginning of an extended attribute block is in
+``struct ext4_xattr_header``, which is 32 bytes long:
+
+.. list-table::
+   :widths: 1 1 1 77
+   :header-rows: 1
+
+   * - Offset
+     - Type
+     - Name
+     - Description
+   * - 0x0
+     - \_\_le32
+     - h\_magic
+     - Magic number for identification, 0xEA020000.
+   * - 0x4
+     - \_\_le32
+     - h\_refcount
+     - Reference count.
+   * - 0x8
+     - \_\_le32
+     - h\_blocks
+     - Number of disk blocks used.
+   * - 0xC
+     - \_\_le32
+     - h\_hash
+     - Hash value of all attributes.
+   * - 0x10
+     - \_\_le32
+     - h\_checksum
+     - Checksum of the extended attribute block.
+   * - 0x14
+     - \_\_u32
+     - h\_reserved[2]
+     - Zero.
+
+The checksum is calculated against the FS UUID, the 64-bit block number
+of the extended attribute block, and the entire block (header +
+entries).
+
+Following the ``struct ext4_xattr_header`` or
+``struct ext4_xattr_ibody_header`` is an array of
+``struct ext4_xattr_entry``; each of these entries is at least 16 bytes
+long. When stored in an external block, the ``struct ext4_xattr_entry``
+entries must be stored in sorted order. The sort order is
+``e_name_index``, then ``e_name_len``, and finally ``e_name``.
+Attributes stored inside an inode do not need be stored in sorted order.
+
+.. list-table::
+   :widths: 1 1 1 77
+   :header-rows: 1
+
+   * - Offset
+     - Type
+     - Name
+     - Description
+   * - 0x0
+     - \_\_u8
+     - e\_name\_len
+     - Length of name.
+   * - 0x1
+     - \_\_u8
+     - e\_name\_index
+     - Attribute name index. There is a discussion of this below.
+   * - 0x2
+     - \_\_le16
+     - e\_value\_offs
+     - Location of this attribute's value on the disk block where it is stored.
+       Multiple attributes can share the same value. For an inode attribute
+       this value is relative to the start of the first entry; for a block this
+       value is relative to the start of the block (i.e. the header).
+   * - 0x4
+     - \_\_le32
+     - e\_value\_inum
+     - The inode where the value is stored. Zero indicates the value is in the
+       same block as this entry. This field is only used if the
+       INCOMPAT\_EA\_INODE feature is enabled.
+   * - 0x8
+     - \_\_le32
+     - e\_value\_size
+     - Length of attribute value.
+   * - 0xC
+     - \_\_le32
+     - e\_hash
+     - Hash value of attribute name and attribute value. The kernel doesn't
+       update the hash for in-inode attributes, so for that case this value
+       must be zero, because e2fsck validates any non-zero hash regardless of
+       where the xattr lives.
+   * - 0x10
+     - char
+     - e\_name[e\_name\_len]
+     - Attribute name. Does not include trailing NULL.
+
+Attribute values can follow the end of the entry table. There appears to
+be a requirement that they be aligned to 4-byte boundaries. The values
+are stored starting at the end of the block and grow towards the
+xattr\_header/xattr\_entry table. When the two collide, the overflow is
+put into a separate disk block. If the disk block fills up, the
+filesystem returns -ENOSPC.
+
+The first four fields of the ``ext4_xattr_entry`` are set to zero to
+mark the end of the key list.
+
+Attribute Name Indices
+~~~~~~~~~~~~~~~~~~~~~~
+
+Logically speaking, extended attributes are a series of key=value pairs.
+The keys are assumed to be NULL-terminated strings. To reduce the amount
+of on-disk space that the keys consume, the beginning of the key string
+is matched against the attribute name index. If a match is found, the
+attribute name index field is set, and matching string is removed from
+the key name. Here is a map of name index values to key prefixes:
+
+.. list-table::
+   :widths: 1 79
+   :header-rows: 1
+
+   * - Name Index
+     - Key Prefix
+   * - 0
+     - (no prefix)
+   * - 1
+     - “user.”
+   * - 2
+     - “system.posix\_acl\_access”
+   * - 3
+     - “system.posix\_acl\_default”
+   * - 4
+     - “trusted.”
+   * - 6
+     - “security.”
+   * - 7
+     - “system.” (inline\_data only?)
+   * - 8
+     - “system.richacl” (SuSE kernels only?)
+
+For example, if the attribute key is “user.fubar”, the attribute name
+index is set to 1 and the “fubar” name is recorded on disk.
+
+POSIX ACLs
+~~~~~~~~~~
+
+POSIX ACLs are stored in a reduced version of the Linux kernel (and
+libacl's) internal ACL format. The key difference is that the version
+number is different (1) and the ``e_id`` field is only stored for named
+user and group ACLs.
diff --git a/Documentation/filesystems/ext4/ondisk/bigalloc.rst b/Documentation/filesystems/ext4/ondisk/bigalloc.rst
new file mode 100644
index 000000000000..c6d88557553c
--- /dev/null
+++ b/Documentation/filesystems/ext4/ondisk/bigalloc.rst
@@ -0,0 +1,22 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Bigalloc
+--------
+
+At the moment, the default size of a block is 4KiB, which is a commonly
+supported page size on most MMU-capable hardware. This is fortunate, as
+ext4 code is not prepared to handle the case where the block size
+exceeds the page size. However, for a filesystem of mostly huge files,
+it is desirable to be able to allocate disk blocks in units of multiple
+blocks to reduce both fragmentation and metadata overhead. The
+`bigalloc <Bigalloc>`__ feature provides exactly this ability. The
+administrator can set a block cluster size at mkfs time (which is stored
+in the s\_log\_cluster\_size field in the superblock); from then on, the
+block bitmaps track clusters, not individual blocks. This means that
+block groups can be several gigabytes in size (instead of just 128MiB);
+however, the minimum allocation unit becomes a cluster, not a block,
+even for directories. TaoBao had a patchset to extend the “use units of
+clusters instead of blocks” to the extent tree, though it is not clear
+where those patches went-- they eventually morphed into “extent tree v2”
+but that code has not landed as of May 2015.
+
diff --git a/Documentation/filesystems/ext4/ondisk/bitmaps.rst b/Documentation/filesystems/ext4/ondisk/bitmaps.rst
new file mode 100644
index 000000000000..c7546dbc197a
--- /dev/null
+++ b/Documentation/filesystems/ext4/ondisk/bitmaps.rst
@@ -0,0 +1,28 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Block and inode Bitmaps
+-----------------------
+
+The data block bitmap tracks the usage of data blocks within the block
+group.
+
+The inode bitmap records which entries in the inode table are in use.
+
+As with most bitmaps, one bit represents the usage status of one data
+block or inode table entry. This implies a block group size of 8 \*
+number\_of\_bytes\_in\_a\_logical\_block.
+
+NOTE: If ``BLOCK_UNINIT`` is set for a given block group, various parts
+of the kernel and e2fsprogs code pretends that the block bitmap contains
+zeros (i.e. all blocks in the group are free). However, it is not
+necessarily the case that no blocks are in use -- if ``meta_bg`` is set,
+the bitmaps and group descriptor live inside the group. Unfortunately,
+ext2fs\_test\_block\_bitmap2() will return '0' for those locations,
+which produces confusing debugfs output.
+
+Inode Table
+-----------
+Inode tables are statically allocated at mkfs time.  Each block group
+descriptor points to the start of the table, and the superblock records
+the number of inodes per group.  See the section on inodes for more
+information.
diff --git a/Documentation/filesystems/ext4/ondisk/blockgroup.rst b/Documentation/filesystems/ext4/ondisk/blockgroup.rst
new file mode 100644
index 000000000000..baf888e4c06a
--- /dev/null
+++ b/Documentation/filesystems/ext4/ondisk/blockgroup.rst
@@ -0,0 +1,135 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Layout
+------
+
+The layout of a standard block group is a
author	Linus Torvalds <torvalds@linux-foundation.org>	2018-08-13 22:34:47 -0700
committer	Linus Torvalds <torvalds@linux-foundation.org>	2018-08-13 22:34:47 -0700
commit	10f3e23f07cb0c20f9bcb77a5b5a7eb2a1b2a2fe (patch)
tree	1fcb34309b3542512c6f3345f092f7adb8c3312c /Documentation
parent	3bb37da509e576c80180fa0e4d1cfcaddf0cb82e (diff)
parent	863c37fcb14f8b66ea831b45fb35a53ac4a8d69e (diff)