summaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2018-08-13 22:34:47 -0700
committerLinus Torvalds <torvalds@linux-foundation.org>2018-08-13 22:34:47 -0700
commit10f3e23f07cb0c20f9bcb77a5b5a7eb2a1b2a2fe (patch)
tree1fcb34309b3542512c6f3345f092f7adb8c3312c /Documentation
parent3bb37da509e576c80180fa0e4d1cfcaddf0cb82e (diff)
parent863c37fcb14f8b66ea831b45fb35a53ac4a8d69e (diff)
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o: - Convert content from the ext4 wiki to Documentation rst files so it is more likely to be updated as we add new features to ext4. - Add 64-bit timestamp support to ext4's superblock fields. - ... and the usual bug fixes and cleanups, including a Spectre gadget fixup and some hardening against maliciously corrupted file systems. * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (34 commits) ext4: remove unneeded variable "err" in ext4_mb_release_inode_pa() ext4: improve code readability in ext4_iget() ext4: fix spectre gadget in ext4_mb_regular_allocator() ext4: check for NUL characters in extended attribute's name ext4: use ext4_warning() for sb_getblk failure ext4: fix race when setting the bitmap corrupted flag ext4: reset error code in ext4_find_entry in fallback ext4: handle layout changes to pinned DAX mappings dax: dax_layout_busy_page() warn on !exceptional docs: fix up the obviously obsolete bits in the new ext4 documentation docs: add new ext4 superblock time extension fields docs: create filesystem internal section ext4: use swap macro in mext_page_double_lock ext4: check allocation failure when duplicating "data" in ext4_remount() ext4: fix warning message in ext4_enable_quotas() ext4: super: extend timestamps to 40 bits jbd2: replace current_kernel_time64 with ktime equivalent ext4: use timespec64 for all inode times ext4: use ktime_get_real_seconds for i_dtime ext4: use 64-bit timestamps for mmp_time ...
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/conf.py2
-rw-r--r--Documentation/filesystems/ext4/ext4.rst (renamed from Documentation/filesystems/ext4.txt)142
-rw-r--r--Documentation/filesystems/ext4/index.rst17
-rw-r--r--Documentation/filesystems/ext4/ondisk/about.rst44
-rw-r--r--Documentation/filesystems/ext4/ondisk/allocators.rst56
-rw-r--r--Documentation/filesystems/ext4/ondisk/attributes.rst191
-rw-r--r--Documentation/filesystems/ext4/ondisk/bigalloc.rst22
-rw-r--r--Documentation/filesystems/ext4/ondisk/bitmaps.rst28
-rw-r--r--Documentation/filesystems/ext4/ondisk/blockgroup.rst135
-rw-r--r--Documentation/filesystems/ext4/ondisk/blockmap.rst49
-rw-r--r--Documentation/filesystems/ext4/ondisk/blocks.rst142
-rw-r--r--Documentation/filesystems/ext4/ondisk/checksums.rst73
-rw-r--r--Documentation/filesystems/ext4/ondisk/directory.rst426
-rw-r--r--Documentation/filesystems/ext4/ondisk/dynamic.rst12
-rw-r--r--Documentation/filesystems/ext4/ondisk/eainode.rst18
-rw-r--r--Documentation/filesystems/ext4/ondisk/globals.rst13
-rw-r--r--Documentation/filesystems/ext4/ondisk/group_descr.rst170
-rw-r--r--Documentation/filesystems/ext4/ondisk/ifork.rst194
-rw-r--r--Documentation/filesystems/ext4/ondisk/index.rst9
-rw-r--r--Documentation/filesystems/ext4/ondisk/inlinedata.rst37
-rw-r--r--Documentation/filesystems/ext4/ondisk/inodes.rst575
-rw-r--r--Documentation/filesystems/ext4/ondisk/journal.rst611
-rw-r--r--Documentation/filesystems/ext4/ondisk/mmp.rst77
-rw-r--r--Documentation/filesystems/ext4/ondisk/overview.rst26
-rw-r--r--Documentation/filesystems/ext4/ondisk/special_inodes.rst38
-rw-r--r--Documentation/filesystems/ext4/ondisk/super.rst801
-rw-r--r--Documentation/index.rst11
27 files changed, 3840 insertions, 79 deletions
diff --git a/Documentation/conf.py b/Documentation/conf.py
index 62ac5a9f3a9f..b691af4831fa 100644
--- a/Documentation/conf.py
+++ b/Documentation/conf.py
@@ -34,7 +34,7 @@ needs_sphinx = '1.3'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
-extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include', 'cdomain', 'kfigure']
+extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include', 'cdomain', 'kfigure', 'sphinx.ext.ifconfig']
# The name of the math extension changed on Sphinx 1.4
if major == 1 and minor > 3:
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4/ext4.rst
index 7f628b9f7c4b..9d4368d591fa 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4/ext4.rst
@@ -1,6 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
-Ext4 Filesystem
-===============
+========================
+General Information
+========================
Ext4 is an advanced level of the ext3 filesystem which incorporates
scalability and reliability enhancements for supporting large filesystems
@@ -11,37 +13,30 @@ Mailing list: linux-ext4@vger.kernel.org
Web site: http://ext4.wiki.kernel.org
-1. Quick usage instructions:
-===========================
+Quick usage instructions
+========================
Note: More extensive information for getting started with ext4 can be
- found at the ext4 wiki site at the URL:
- http://ext4.wiki.kernel.org/index.php/Ext4_Howto
+found at the ext4 wiki site at the URL:
+http://ext4.wiki.kernel.org/index.php/Ext4_Howto
- - Compile and install the latest version of e2fsprogs (as of this
- writing version 1.41.3) from:
+ - The latest version of e2fsprogs can be found at:
+
+ https://www.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
- http://sourceforge.net/project/showfiles.php?group_id=2406
-
or
- https://www.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
+ http://sourceforge.net/project/showfiles.php?group_id=2406
or grab the latest git repository from:
- git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
-
- - Note that it is highly important to install the mke2fs.conf file
- that comes with the e2fsprogs 1.41.x sources in /etc/mke2fs.conf. If
- you have edited the /etc/mke2fs.conf file installed on your system,
- you will need to merge your changes with the version from e2fsprogs
- 1.41.x.
+ https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
- Create a new filesystem using the ext4 filesystem type:
- # mke2fs -t ext4 /dev/hda1
+ # mke2fs -t ext4 /dev/hda1
- Or to configure an existing ext3 filesystem to support extents:
+ Or to configure an existing ext3 filesystem to support extents:
# tune2fs -O extents /dev/hda1
@@ -50,10 +45,6 @@ Note: More extensive information for getting started with ext4 can be
# tune2fs -I 256 /dev/hda1
- (Note: we currently do not have tools to convert an ext4
- filesystem back to ext3; so please do not do try this on production
- filesystems.)
-
- Mounting:
# mount -t ext4 /dev/hda1 /wherever
@@ -75,10 +66,11 @@ Note: More extensive information for getting started with ext4 can be
the filesystem with a large journal can also be helpful for
metadata-intensive workloads.
-2. Features
-===========
+Features
+========
-2.1 Currently available
+Currently Available
+-------------------
* ability to use filesystems > 16TB (e2fsprogs support not available yet)
* extent format reduces metadata overhead (RAM, IO for access, transactions)
@@ -103,31 +95,15 @@ Note: More extensive information for getting started with ext4 can be
[1] Filesystems with a block size of 1k may see a limit imposed by the
directory hash tree having a maximum depth of two.
-2.2 Candidate features for future inclusion
-
-* online defrag (patches available but not well tested)
-* reduced mke2fs time via lazy itable initialization in conjunction with
- the uninit_bg feature (capability to do this is available in e2fsprogs
- but a kernel thread to do lazy zeroing of unused inode table blocks
- after filesystem is first mounted is required for safety)
-
-There are several others under discussion, whether they all make it in is
-partly a function of how much time everyone has to work on them. Features like
-metadata checksumming have been discussed and planned for a bit but no patches
-exist yet so I'm not sure they're in the near-term roadmap.
-
-The big performance win will come with mballoc, delalloc and flex_bg
-grouping of bitmaps and inode tables. Some test results available here:
-
- - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-write-2.6.27-rc1.html
- - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-readwrite-2.6.27-rc1.html
-
-3. Options
-==========
+Options
+=======
When mounting an ext4 filesystem, the following option are accepted:
(*) == default
+======================= =======================================================
+Mount Option Description
+======================= =======================================================
ro Mount filesystem read only. Note that ext4 will
replay the journal (and thus write to the
partition) even when mounted "read only". The
@@ -387,33 +363,38 @@ i_version Enable 64-bit inode version support. This option is
dax Use direct access (no page cache). See
Documentation/filesystems/dax.txt. Note that
this option is incompatible with data=journal.
+======================= =======================================================
Data Mode
=========
There are 3 different data modes:
* writeback mode
-In data=writeback mode, ext4 does not journal data at all. This mode provides
-a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
-mode - metadata journaling. A crash+recovery can cause incorrect data to
-appear in files which were written shortly before the crash. This mode will
-typically provide the best ext4 performance.
+
+ In data=writeback mode, ext4 does not journal data at all. This mode provides
+ a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
+ mode - metadata journaling. A crash+recovery can cause incorrect data to
+ appear in files which were written shortly before the crash. This mode will
+ typically provide the best ext4 performance.
* ordered mode
-In data=ordered mode, ext4 only officially journals metadata, but it logically
-groups metadata information related to data changes with the data blocks into a
-single unit called a transaction. When it's time to write the new metadata
-out to disk, the associated data blocks are written first. In general,
-this mode performs slightly slower than writeback but significantly faster than journal mode.
+
+ In data=ordered mode, ext4 only officially journals metadata, but it logically
+ groups metadata information related to data changes with the data blocks into
+ a single unit called a transaction. When it's time to write the new metadata
+ out to disk, the associated data blocks are written first. In general, this
+ mode performs slightly slower than writeback but significantly faster than
+ journal mode.
* journal mode
-data=journal mode provides full data and metadata journaling. All new data is
-written to the journal first, and then to its final location.
-In the event of a crash, the journal can be replayed, bringing both data and
-metadata into a consistent state. This mode is the slowest except when data
-needs to be read from and written to disk at the same time where it
-outperforms all others modes. Enabling this mode will disable delayed
-allocation and O_DIRECT support.
+
+ data=journal mode provides full data and metadata journaling. All new data is
+ written to the journal first, and then to its final location. In the event of
+ a crash, the journal can be replayed, bringing both data and metadata into a
+ consistent state. This mode is the slowest except when data needs to be read
+ from and written to disk at the same time where it outperforms all others
+ modes. Enabling this mode will disable delayed allocation and O_DIRECT
+ support.
/proc entries
=============
@@ -425,10 +406,12 @@ Information about mounted ext4 file systems can be found in
in table below.
Files in /proc/fs/ext4/<devname>
-..............................................................................
+
+================ =======
File Content
+================ =======
mb_groups details of multiblock allocator buddy cache of free blocks
-..............................................................................
+================ =======
/sys entries
============
@@ -439,28 +422,30 @@ Information about mounted ext4 file systems can be found in
/sys/fs/ext4/dm-0). The files in each per-device directory are shown
in table below.
-Files in /sys/fs/ext4/<devname>
+Files in /sys/fs/ext4/<devname>:
+
(see also Documentation/ABI/testing/sysfs-fs-ext4)
-..............................................................................
- File Content
+============================= =================================================
+File Content
+============================= =================================================
delayed_allocation_blocks This file is read-only and shows the number of
blocks that are dirty in the page cache, but
which do not have their location in the
filesystem allocated yet.
- inode_goal Tuning parameter which (if non-zero) controls
+inode_goal Tuning parameter which (if non-zero) controls
the goal inode used by the inode allocator in
preference to all other allocation heuristics.
This is intended for debugging use only, and
should be 0 on production systems.
- inode_readahead_blks Tuning parameter which controls the maximum
+inode_readahead_blks Tuning parameter which controls the maximum
number of inode table blocks that ext4's inode
table readahead algorithm will pre-read into
the buffer cache
- lifetime_write_kbytes This file is read-only and shows the number of
+lifetime_write_kbytes This file is read-only and shows the number of
kilobytes of data that have been written to this
filesystem since it was created.
@@ -508,7 +493,7 @@ Files in /sys/fs/ext4/<devname>
in the file system. If there is not enough space
for the reserved space when mounting the file
mount will _not_ fail.
-..............................................................................
+============================= =================================================
Ioctls
======
@@ -518,8 +503,10 @@ through the system call interfaces. The list of all Ext4 specific ioctls are
shown in the table below.
Table of Ext4 specific ioctls
-..............................................................................
- Ioctl Description
+
+============================= =================================================
+Ioctl Description
+============================= =================================================
EXT4_IOC_GETFLAGS Get additional attributes associated with inode.
The ioctl argument is an integer bitfield, with
bit values described in ext4.h. This ioctl is an
@@ -610,8 +597,7 @@ Table of Ext4 specific ioctls
normal user by accident.
The data blocks of the previous boot loader
will be associated with the given inode.
-
-..............................................................................
+============================= =================================================
References
==========
diff --git a/Documentation/filesystems/ext4/index.rst b/Documentation/filesystems/ext4/index.rst
new file mode 100644
index 000000000000..71121605558c
--- /dev/null
+++ b/Documentation/filesystems/ext4/index.rst
@@ -0,0 +1,17 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+ext4 Filesystem
+===============
+
+General usage and on-disk artifacts writen by ext4. More documentation may
+be ported from the wiki as time permits. This should be considered the
+canonical source of information as the details here have been reviewed by
+the ext4 community.
+
+.. toctree::
+ :maxdepth: 5
+ :numbered:
+
+ ext4
+ ondisk/index
diff --git a/Documentation/filesystems/ext4/ondisk/about.rst b/Documentation/filesystems/ext4/ondisk/about.rst
new file mode 100644
index 000000000000..0aadba052264
--- /dev/null
+++ b/Documentation/filesystems/ext4/ondisk/about.rst
@@ -0,0 +1,44 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+About this Book
+===============
+
+This document attempts to describe the on-disk format for ext4
+filesystems. The same general ideas should apply to ext2/3 filesystems
+as well, though they do not support all the features that ext4 supports,
+and the fields will be shorter.
+
+**NOTE**: This is a work in progress, based on notes that the author
+(djwong) made while picking apart a filesystem by hand. The data
+structure definitions should be current as of Linux 4.18 and
+e2fsprogs-1.44. All comments and corrections are welcome, since there is
+undoubtedly plenty of lore that might not be reflected in freshly
+created demonstration filesystems.
+
+License
+-------
+This book is licensed under the terms of the GNU Public License, v2.
+
+Terminology
+-----------
+
+ext4 divides a storage device into an array of logical blocks both to
+reduce bookkeeping overhead and to increase throughput by forcing larger
+transfer sizes. Generally, the block size will be 4KiB (the same size as
+pages on x86 and the block layer's default block size), though the
+actual size is calculated as 2 ^ (10 + ``sb.s_log_block_size``) bytes.
+Throughout this document, disk locations are given in terms of these
+logical blocks, not raw LBAs, and not 1024-byte blocks. For the sake of
+convenience, the logical block size will be referred to as
+``$block_size`` throughout the rest of the document.
+
+When referenced in ``preformatted text`` blocks, ``sb`` refers to fields
+in the super block, and ``inode`` refers to fields in an inode table
+entry.
+
+Other References
+----------------
+
+Also see http://www.nongnu.org/ext2-doc/ for quite a collection of
+information about ext2/3. Here's another old reference:
+http://wiki.osdev.org/Ext2
diff --git a/Documentation/filesystems/ext4/ondisk/allocators.rst b/Documentation/filesystems/ext4/ondisk/allocators.rst
new file mode 100644
index 000000000000..7aa85152ace3
--- /dev/null
+++ b/Documentation/filesystems/ext4/ondisk/allocators.rst
@@ -0,0 +1,56 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Block and Inode Allocation Policy
+---------------------------------
+
+ext4 recognizes (better than ext3, anyway) that data locality is
+generally a desirably quality of a filesystem. On a spinning disk,
+keeping related blocks near each other reduces the amount of movement
+that the head actuator and disk must perform to access a data block,
+thus speeding up disk IO. On an SSD there of course are no moving parts,
+but locality can increase the size of each transfer request while
+reducing the total number of requests. This locality may also have the
+effect of concentrating writes on a single erase block, which can speed
+up file rewrites significantly. Therefore, it is useful to reduce
+fragmentation whenever possible.
+
+The first tool that ext4 uses to combat fragmentation is the multi-block
+allocator. When a file is first created, the block allocator
+speculatively allocates 8KiB of disk space to the file on the assumption
+that the space will get written soon. When the file is closed, the
+unused speculative allocations are of course freed, but if the
+speculation is correct (typically the case for full writes of small
+files) then the file data gets written out in a single multi-block
+extent. A second related trick that ext4 uses is delayed allocation.
+Under this scheme, when a file needs more blocks to absorb file writes,
+the filesystem defers deciding the exact placement on the disk until all
+the dirty buffers are being written out to disk. By not committing to a
+particular placement until it's absolutely necessary (the commit timeout
+is hit, or sync() is called, or the kernel runs out of memory), the hope
+is that the filesystem can make better location decisions.
+
+The third trick that ext4 (and ext3) uses is that it tries to keep a
+file's data blocks in the same block group as its inode. This cuts down
+on the seek penalty when the filesystem first has to read a file's inode
+to learn where the file's data blocks live and then seek over to the
+file's data blocks to begin I/O operations.
+
+The fourth trick is that all the inodes in a directory are placed in the
+same block group as the directory, when feasible. The working assumption
+here is that all the files in a directory might be related, therefore it
+is useful to try to keep them all together.
+
+The fifth trick is that the disk volume is cut up into 128MB block
+groups; these mini-containers are used as outlined above to try to
+maintain data locality. However, there is a deliberate quirk -- when a
+directory is created in the root directory, the inode allocator scans
+the block groups and puts that directory into the least heavily loaded
+block group that it can find. This encourages directories to spread out
+over a disk; as the top-level directory/file blobs fill up one block
+group, the allocators simply move on to the next block group. Allegedly
+this scheme evens out the loading on the block groups, though the author
+suspects that the directories which are so unlucky as to land towards
+the end of a spinning drive get a raw deal performance-wise.
+
+Of course if all of these mechanisms fail, one can always use e4defrag
+to defragment files.
diff --git a/Documentation/filesystems/ext4/ondisk/attributes.rst b/Documentation/filesystems/ext4/ondisk/attributes.rst
new file mode 100644
index 000000000000..0b01b67b81fe
--- /dev/null
+++ b/Documentation/filesystems/ext4/ondisk/attributes.rst
@@ -0,0 +1,191 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Extended Attributes
+-------------------
+
+Extended attributes (xattrs) are typically stored in a separate data
+block on the disk and referenced from inodes via ``inode.i_file_acl*``.
+The first use of extended attributes seems to have been for storing file
+ACLs and other security data (selinux). With the ``user_xattr`` mount
+option it is possible for users to store extended attributes so long as
+all attribute names begin with “user”; this restriction seems to have
+disappeared as of Linux 3.0.
+
+There are two places where extended attributes can be found. The first
+place is between the end of each inode entry and the beginning of the
+next inode entry. For example, if inode.i\_extra\_isize = 28 and
+sb.inode\_size = 256, then there are 256 - (128 + 28) = 100 bytes
+available for in-inode extended attribute storage. The second place
+where extended attributes can be found is in the block pointed to by
+``inode.i_file_acl``. As of Linux 3.11, it is not possible for this
+block to contain a pointer to a second extended attribute block (or even
+the remaining blocks of a cluster). In theory it is possible for each
+attribute's value to be stored in a separate data block, though as of
+Linux 3.11 the code does not permit this.
+
+Keys are generally assumed to be ASCIIZ strings, whereas values can be
+strings or binary data.
+
+Extended attributes, when stored after the inode, have a header
+``ext4_xattr_ibody_header`` that is 4 bytes long:
+
+.. list-table::
+ :widths: 1 1 1 77
+ :header-rows: 1
+
+ * - Offset
+ - Type
+ - Name
+ - Description
+ * - 0x0
+ - \_\_le32
+ - h\_magic
+ - Magic number for identification, 0xEA020000. This value is set by the
+ Linux driver, though e2fsprogs doesn't seem to check it(?)
+
+The beginning of an extended attribute block is in
+``struct ext4_xattr_header``, which is 32 bytes long:
+
+.. list-table::
+ :widths: 1 1 1 77
+ :header-rows: 1
+
+ * - Offset
+ - Type
+ - Name
+ - Description
+ * - 0x0
+ - \_\_le32
+ - h\_magic
+ - Magic number for identification, 0xEA020000.
+ * - 0x4
+ - \_\_le32
+ - h\_refcount
+ - Reference count.
+ * - 0x8
+ - \_\_le32
+ - h\_blocks
+ - Number of disk blocks used.
+ * - 0xC
+ - \_\_le32
+ - h\_hash
+ - Hash value of all attributes.
+ * - 0x10
+ - \_\_le32
+ - h\_checksum
+ - Checksum of the extended attribute block.
+ * - 0x14
+ - \_\_u32
+ - h\_reserved[2]
+ - Zero.
+
+The checksum is calculated against the FS UUID, the 64-bit block number
+of the extended attribute block, and the entire block (header +
+entries).
+
+Following the ``struct ext4_xattr_header`` or
+``struct ext4_xattr_ibody_header`` is an array of
+``struct ext4_xattr_entry``; each of these entries is at least 16 bytes
+long. When stored in an external block, the ``struct ext4_xattr_entry``
+entries must be stored in sorted order. The sort order is
+``e_name_index``, then ``e_name_len``, and finally ``e_name``.
+Attributes stored inside an inode do not need be stored in sorted order.
+
+.. list-table::
+ :widths: 1 1 1 77
+ :header-rows: 1
+
+ * - Offset
+ - Type
+ - Name
+ - Description
+ * - 0x0
+ - \_\_u8
+ - e\_name\_len
+ - Length of name.
+ * - 0x1
+ - \_\_u8
+ - e\_name\_index
+ - Attribute name index. There is a discussion of this below.
+ * - 0x2
+ - \_\_le16
+ - e\_value\_offs
+ - Location of this attribute's value on the disk block where it is stored.
+ Multiple attributes can share the same value. For an inode attribute
+ this value is relative to the start of the first entry; for a block this
+ value is relative to the start of the block (i.e. the header).
+ * - 0x4
+ - \_\_le32
+ - e\_value\_inum
+ - The inode where the value is stored. Zero indicates the value is in the
+ same block as this entry. This field is only used if the
+ INCOMPAT\_EA\_INODE feature is enabled.
+ * - 0x8
+ - \_\_le32
+ - e\_value\_size
+ - Length of attribute value.
+ * - 0xC
+ - \_\_le32
+ - e\_hash
+ - Hash value of attribute name and attribute value. The kernel doesn't
+ update the hash for in-inode attributes, so for that case this value
+ must be zero, because e2fsck validates any non-zero hash regardless of
+ where the xattr lives.
+ * - 0x10
+ - char
+ - e\_name[e\_name\_len]
+ - Attribute name. Does not include trailing NULL.
+
+Attribute values can follow the end of the entry table. There appears to
+be a requirement that they be aligned to 4-byte boundaries. The values
+are stored starting at the end of the block and grow towards the
+xattr\_header/xattr\_entry table. When the two collide, the overflow is
+put into a separate disk block. If the disk block fills up, the
+filesystem returns -ENOSPC.
+
+The first four fields of the ``ext4_xattr_entry`` are set to zero to
+mark the end of the key list.
+
+Attribute Name Indices
+~~~~~~~~~~~~~~~~~~~~~~
+
+Logically speaking, extended attributes are a series of key=value pairs.
+The keys are assumed to be NULL-terminated strings. To reduce the amount
+of on-disk space that the keys consume, the beginning of the key string
+is matched against the attribute name index. If a match is found, the
+attribute name index field is set, and matching string is removed from
+the key name. Here is a map of name index values to key prefixes:
+
+.. list-table::
+ :widths: 1 79
+ :header-rows: 1
+
+ * - Name Index
+ - Key Prefix
+ * - 0
+ - (no prefix)
+ * - 1
+ - “user.”
+ * - 2
+ - “system.posix\_acl\_access”
+ * - 3
+ - “system.posix\_acl\_default”
+ * - 4
+ - “trusted.”
+ * - 6
+ - “security.”
+ * - 7
+ - “system.” (inline\_data only?)
+ * - 8
+ - “system.richacl” (SuSE kernels only?)
+
+For example, if the attribute key is “user.fubar”, the attribute name
+index is set to 1 and the “fubar” name is recorded on disk.
+
+POSIX ACLs
+~~~~~~~~~~
+
+POSIX ACLs are stored in a reduced version of the Linux kernel (and
+libacl's) internal ACL format. The key difference is that the version
+number is different (1) and the ``e_id`` field is only stored for named
+user and group ACLs.
diff --git a/Documentation/filesystems/ext4/ondisk/bigalloc.rst b/Documentation/filesystems/ext4/ondisk/bigalloc.rst
new file mode 100644
index 000000000000..c6d88557553c
--- /dev/null
+++ b/Documentation/filesystems/ext4/ondisk/bigalloc.rst
@@ -0,0 +1,22 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Bigalloc
+--------
+
+At the moment, the default size of a block is 4KiB, which is a commonly
+supported page size on most MMU-capable hardware. This is fortunate, as
+ext4 code is not prepared to handle the case where the block size
+exceeds the page size. However, for a filesystem of mostly huge files,
+it is desirable to be able to allocate disk blocks in units of multiple
+blocks to reduce both fragmentation and metadata overhead. The
+`bigalloc <Bigalloc>`__ feature provides exactly this ability. The
+administrator can set a block cluster size at mkfs time (which is stored
+in the s\_log\_cluster\_size field in the superblock); from then on, the
+block bitmaps track clusters, not individual blocks. This means that
+block groups can be several gigabytes in size (instead of just 128MiB);
+however, the minimum allocation unit becomes a cluster, not a block,
+even for directories. TaoBao had a patchset to extend the “use units of
+clusters instead of blocks” to the extent tree, though it is not clear
+where those patches went-- they eventually morphed into “extent tree v2”
+but that code has not landed as of May 2015.
+
diff --git a/Documentation/filesystems/ext4/ondisk/bitmaps.rst b/Documentation/filesystems/ext4/ondisk/bitmaps.rst
new file mode 100644
index 000000000000..c7546dbc197a
--- /dev/null
+++ b/Documentation/filesystems/ext4/ondisk/bitmaps.rst
@@ -0,0 +1,28 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Block and inode Bitmaps
+-----------------------
+
+The data block bitmap tracks the usage of data blocks within the block
+group.
+
+The inode bitmap records which entries in the inode table are in use.
+
+As with most bitmaps, one bit represents the usage status of one data
+block or inode table entry. This implies a block group size of 8 \*
+number\_of\_bytes\_in\_a\_logical\_block.
+
+NOTE: If ``BLOCK_UNINIT`` is set for a given block group, various parts
+of the kernel and e2fsprogs code pretends that the block bitmap contains
+zeros (i.e. all blocks in the group are free). However, it is not
+necessarily the case that no blocks are in use -- if ``meta_bg`` is set,
+the bitmaps and group descriptor live inside the group. Unfortunately,
+ext2fs\_test\_block\_bitmap2() will return '0' for those locations,
+which produces confusing debugfs output.
+
+Inode Table
+-----------
+Inode tables are statically allocated at mkfs time. Each block group
+descriptor points to the start of the table, and the superblock records
+the number of inodes per group. See the section on inodes for more
+information.
diff --git a/Documentation/filesystems/ext4/ondisk/blockgroup.rst b/Documentation/filesystems/ext4/ondisk/blockgroup.rst
new file mode 100644
index 000000000000..baf888e4c06a
--- /dev/null
+++ b/Documentation/filesystems/ext4/ondisk/blockgroup.rst
@@ -0,0 +1,135 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Layout
+------
+
+The layout of a standard block group is a