summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2017-05-03 17:55:59 -0700
committerLinus Torvalds <torvalds@linux-foundation.org>2017-05-03 17:55:59 -0700
commitdd23f273d9a765d7f092c1bb0d1cd7aaf668077e (patch)
tree9bf826a9f553c9b0a5e852deaaf58bee56b601ac
parent1684096b1ed813f621fb6cbd06e72235c1c2a0ca (diff)
parentb19385993623c1a18a686b6b271cd24d5aa96f52 (diff)
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton: - a few misc things - most of MM - KASAN updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits) kasan: separate report parts by empty lines kasan: improve double-free report format kasan: print page description after stacks kasan: improve slab object description kasan: change report header kasan: simplify address description logic kasan: change allocation and freeing stack traces headers kasan: unify report headers kasan: introduce helper functions for determining bug type mm: hwpoison: call shake_page() after try_to_unmap() for mlocked page mm: hwpoison: call shake_page() unconditionally mm/swapfile.c: fix swap space leak in error path of swap_free_entries() mm/gup.c: fix access_ok() argument type mm/truncate: avoid pointless cleancache_invalidate_inode() calls. mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty fs/block_dev: always invalidate cleancache in invalidate_bdev() fs: fix data invalidation in the cleancache during direct IO zram: reduce load operation in page_same_filled zram: use zram_free_page instead of open-coded zram: introduce zram data accessor ...
-rw-r--r--Documentation/cgroup-v2.txt5
-rw-r--r--Documentation/filesystems/proc.txt6
-rw-r--r--Documentation/vm/00-INDEX2
-rw-r--r--Documentation/vm/hugetlbfs_reserv.txt529
-rw-r--r--arch/blackfin/mach-bf609/clock.c3
-rw-r--r--drivers/block/zram/zram_drv.c577
-rw-r--r--drivers/block/zram/zram_drv.h6
-rw-r--r--drivers/tty/sysrq.c2
-rw-r--r--fs/block_dev.c11
-rw-r--r--fs/iomap.c18
-rw-r--r--fs/jbd2/journal.c9
-rw-r--r--fs/jbd2/transaction.c12
-rw-r--r--fs/ocfs2/cluster/heartbeat.c8
-rw-r--r--fs/ocfs2/cluster/tcp.c7
-rw-r--r--fs/proc/task_mmu.c8
-rw-r--r--fs/xfs/kmem.c12
-rw-r--r--fs/xfs/kmem.h2
-rw-r--r--fs/xfs/libxfs/xfs_btree.c2
-rw-r--r--fs/xfs/xfs_aops.c6
-rw-r--r--fs/xfs/xfs_buf.c8
-rw-r--r--fs/xfs/xfs_trans.c12
-rw-r--r--include/linux/gfp.h22
-rw-r--r--include/linux/jbd2.h2
-rw-r--r--include/linux/ksm.h5
-rw-r--r--include/linux/memcontrol.h201
-rw-r--r--include/linux/migrate.h5
-rw-r--r--include/linux/mm.h1
-rw-r--r--include/linux/mmzone.h7
-rw-r--r--include/linux/rmap.h47
-rw-r--r--include/linux/rodata_test.h1
-rw-r--r--include/linux/sched.h6
-rw-r--r--include/linux/sched/mm.h26
-rw-r--r--include/linux/swap.h5
-rw-r--r--include/linux/vm_event_item.h2
-rw-r--r--kernel/locking/lockdep.c11
-rw-r--r--lib/dma-debug.c8
-rw-r--r--lib/radix-tree.c2
-rw-r--r--mm/Kconfig.debug1
-rw-r--r--mm/compaction.c6
-rw-r--r--mm/filemap.c42
-rw-r--r--mm/gup.c2
-rw-r--r--mm/huge_memory.c12
-rw-r--r--mm/hwpoison-inject.c3
-rw-r--r--mm/internal.h17
-rw-r--r--mm/kasan/kasan.c3
-rw-r--r--mm/kasan/kasan.h2
-rw-r--r--mm/kasan/report.c187
-rw-r--r--mm/khugepaged.c12
-rw-r--r--mm/ksm.c16
-rw-r--r--mm/madvise.c56
-rw-r--r--mm/memcontrol.c248
-rw-r--r--mm/memory-failure.c79
-rw-r--r--mm/memory_hotplug.c6
-rw-r--r--mm/migrate.c10
-rw-r--r--mm/mlock.c6
-rw-r--r--mm/mmap.c2
-rw-r--r--mm/oom_kill.c2
-rw-r--r--mm/page-writeback.c15
-rw-r--r--mm/page_alloc.c77
-rw-r--r--mm/page_ext.c13
-rw-r--r--mm/page_idle.c4
-rw-r--r--mm/page_isolation.c6
-rw-r--r--mm/page_poison.c77
-rw-r--r--mm/rmap.c148
-rw-r--r--mm/rodata_test.c17
-rw-r--r--mm/slab.c7
-rw-r--r--mm/sparse.c5
-rw-r--r--mm/swap.c49
-rw-r--r--mm/swap_slots.c4
-rw-r--r--mm/swap_state.c12
-rw-r--r--mm/swapfile.c35
-rw-r--r--mm/truncate.c13
-rw-r--r--mm/vmscan.c508
-rw-r--r--mm/vmstat.c72
-rw-r--r--mm/workingset.c6
-rw-r--r--scripts/spelling.txt25
-rw-r--r--tools/testing/selftests/vm/Makefile11
-rwxr-xr-xtools/testing/selftests/vm/run_vmtests6
-rw-r--r--tools/testing/selftests/vm/userfaultfd.c207
79 files changed, 2141 insertions, 1484 deletions
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 49d7c997fa1e..e50b95c25868 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -871,6 +871,11 @@ PAGE_SIZE multiple when read back.
Amount of memory used in network transmission buffers
+ shmem
+
+ Amount of cached filesystem data that is swap-backed,
+ such as tmpfs, shm segments, shared anonymous mmap()s
+
file_mapped
Amount of cached filesystem data mapped with mmap()
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 9036dbf16156..4cddbce85ac9 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -413,6 +413,7 @@ Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 892 kB
Anonymous: 0 kB
+LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
@@ -442,6 +443,11 @@ accessed.
"Anonymous" shows the amount of memory that does not belong to any file. Even
a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
and a page is modified, the file page is replaced by a private anonymous copy.
+"LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE).
+The memory isn't freed immediately with madvise(). It's freed in memory
+pressure if the memory is clean. Please note that the printed value might
+be lower than the real value due to optimizations used in the current
+implementation. If this is not desirable please file a bug report.
"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
"ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by
huge pages.
diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
index 6a5e2a102a45..11d3d8dcb449 100644
--- a/Documentation/vm/00-INDEX
+++ b/Documentation/vm/00-INDEX
@@ -12,6 +12,8 @@ highmem.txt
- Outline of highmem and common issues.
hugetlbpage.txt
- a brief summary of hugetlbpage support in the Linux kernel.
+hugetlbfs_reserv.txt
+ - A brief overview of hugetlbfs reservation design/implementation.
hwpoison.txt
- explains what hwpoison is
idle_page_tracking.txt
diff --git a/Documentation/vm/hugetlbfs_reserv.txt b/Documentation/vm/hugetlbfs_reserv.txt
new file mode 100644
index 000000000000..9aca09a76bed
--- /dev/null
+++ b/Documentation/vm/hugetlbfs_reserv.txt
@@ -0,0 +1,529 @@
+Hugetlbfs Reservation Overview
+------------------------------
+Huge pages as described at 'Documentation/vm/hugetlbpage.txt' are typically
+preallocated for application use. These huge pages are instantiated in a
+task's address space at page fault time if the VMA indicates huge pages are
+to be used. If no huge page exists at page fault time, the task is sent
+a SIGBUS and often dies an unhappy death. Shortly after huge page support
+was added, it was determined that it would be better to detect a shortage
+of huge pages at mmap() time. The idea is that if there were not enough
+huge pages to cover the mapping, the mmap() would fail. This was first
+done with a simple check in the code at mmap() time to determine if there
+were enough free huge pages to cover the mapping. Like most things in the
+kernel, the code has evolved over time. However, the basic idea was to
+'reserve' huge pages at mmap() time to ensure that huge pages would be
+available for page faults in that mapping. The description below attempts to
+describe how huge page reserve processing is done in the v4.10 kernel.
+
+
+Audience
+--------
+This description is primarily targeted at kernel developers who are modifying
+hugetlbfs code.
+
+
+The Data Structures
+-------------------
+resv_huge_pages
+ This is a global (per-hstate) count of reserved huge pages. Reserved
+ huge pages are only available to the task which reserved them.
+ Therefore, the number of huge pages generally available is computed
+ as (free_huge_pages - resv_huge_pages).
+Reserve Map
+ A reserve map is described by the structure:
+ struct resv_map {
+ struct kref refs;
+ spinlock_t lock;
+ struct list_head regions;
+ long adds_in_progress;
+ struct list_head region_cache;
+ long region_cache_count;
+ };
+ There is one reserve map for each huge page mapping in the system.
+ The regions list within the resv_map describes the regions within
+ the mapping. A region is described as:
+ struct file_region {
+ struct list_head link;
+ long from;
+ long to;
+ };
+ The 'from' and 'to' fields of the file region structure are huge page
+ indices into the mapping. Depending on the type of mapping, a
+ region in the reserv_map may indicate reservations exist for the
+ range, or reservations do not exist.
+Flags for MAP_PRIVATE Reservations
+ These are stored in the bottom bits of the reservation map pointer.
+ #define HPAGE_RESV_OWNER (1UL << 0) Indicates this task is the
+ owner of the reservations associated with the mapping.
+ #define HPAGE_RESV_UNMAPPED (1UL << 1) Indicates task originally
+ mapping this range (and creating reserves) has unmapped a
+ page from this task (the child) due to a failed COW.
+Page Flags
+ The PagePrivate page flag is used to indicate that a huge page
+ reservation must be restored when the huge page is freed. More
+ details will be discussed in the "Freeing huge pages" section.
+
+
+Reservation Map Location (Private or Shared)
+--------------------------------------------
+A huge page mapping or segment is either private or shared. If private,
+it is typically only available to a single address space (task). If shared,
+it can be mapped into multiple address spaces (tasks). The location and
+semantics of the reservation map is significantly different for two types
+of mappings. Location differences are:
+- For private mappings, the reservation map hangs off the the VMA structure.
+ Specifically, vma->vm_private_data. This reserve map is created at the
+ time the mapping (mmap(MAP_PRIVATE)) is created.
+- For shared mappings, the reservation map hangs off the inode. Specifically,
+ inode->i_mapping->private_data. Since shared mappings are always backed
+ by files in the hugetlbfs filesystem, the hugetlbfs code ensures each inode
+ contains a reservation map. As a result, the reservation map is allocated
+ when the inode is created.
+
+
+Creating Reservations
+---------------------
+Reservations are created when a huge page backed shared memory segment is
+created (shmget(SHM_HUGETLB)) or a mapping is created via mmap(MAP_HUGETLB).
+These operations result in a call to the routine hugetlb_reserve_pages()
+
+int hugetlb_reserve_pages(struct inode *inode,
+ long from, long to,
+ struct vm_area_struct *vma,
+ vm_flags_t vm_flags)
+
+The first thing hugetlb_reserve_pages() does is check for the NORESERVE
+flag was specified in either the shmget() or mmap() call. If NORESERVE
+was specified, then this routine returns immediately as no reservation
+are desired.
+
+The arguments 'from' and 'to' are huge page indices into the mapping or
+underlying file. For shmget(), 'from' is always 0 and 'to' corresponds to
+the length of the segment/mapping. For mmap(), the offset argument could
+be used to specify the offset into the underlying file. In such a case
+the 'from' and 'to' arguments have been adjusted by this offset.
+
+One of the big differences between PRIVATE and SHARED mappings is the way
+in which reservations are represented in the reservation map.
+- For shared mappings, an entry in the reservation map indicates a reservation
+ exists or did exist for the corresponding page. As reservations are
+ consumed, the reservation map is not modified.
+- For private mappings, the lack of an entry in the reservation map indicates
+ a reservation exists for the corresponding page. As reservations are
+ consumed, entries are added to the reservation map. Therefore, the
+ reservation map can also be used to determine which reservations have
+ been consumed.
+
+For private mappings, hugetlb_reserve_pages() creates the reservation map and
+hangs it off the VMA structure. In addition, the HPAGE_RESV_OWNER flag is set
+to indicate this VMA owns the reservations.
+
+The reservation map is consulted to determine how many huge page reservations
+are needed for the current mapping/segment. For private mappings, this is
+always the value (to - from). However, for shared mappings it is possible that some reservations may already exist within the range (to - from). See the
+section "Reservation Map Modifications" for details on how this is accomplished.
+
+The mapping may be associated with a subpool. If so, the subpool is consulted
+to ensure there is sufficient space for the mapping. It is possible that the
+subpool has set aside reservations that can be used for the mapping. See the
+section "Subpool Reservations" for more details.
+
+After consulting the reservation map and subpool, the number of needed new
+reservations is known. The routine hugetlb_acct_memory() is called to check
+for and take the requested number of reservations. hugetlb_acct_memory()
+calls into routines that potentially allocate and adjust surplus page counts.
+However, within those routines the code is simply checking to ensure there
+are enough free huge pages to accommodate the reservation. If there are,
+the global reservation count resv_huge_pages is adjusted something like the
+following.
+ if (resv_needed <= (resv_huge_pages - free_huge_pages))
+ resv_huge_pages += resv_needed;
+Note that the global lock hugetlb_lock is held when checking and adjusting
+these counters.
+
+If there were enough free huge pages and the global count resv_huge_pages
+was adjusted, then the reservation map associated with the mapping is
+modified to reflect the reservations. In the case of a shared mapping, a
+file_region will exist that includes the range 'from' 'to'. For private
+mappings, no modifications are made to the reservation map as lack of an
+entry indicates a reservation exists.
+
+If hugetlb_reserve_pages() was successful, the global reservation count and
+reservation map associated with the mapping will be modified as required to
+ensure reservations exist for the range 'from' - 'to'.
+
+
+Consuming Reservations/Allocating a Huge Page
+---------------------------------------------
+Reservations are consumed when huge pages associated with the reservations
+are allocated and instantiated in the corresponding mapping. The allocation
+is performed within the routine alloc_huge_page().
+struct page *alloc_huge_page(struct vm_area_struct *vma,
+ unsigned long addr, int avoid_reserve)
+alloc_huge_page is passed a VMA pointer and a virtual address, so it can
+consult the reservation map to determine if a reservation exists. In addition,
+alloc_huge_page takes the argument avoid_reserve which indicates reserves
+should not be used even if it appears they have been set aside for the
+specified address. The avoid_reserve argument is most often used in the case
+of Copy on Write and Page Migration where additional copies of an existing
+page are being allocated.
+
+The helper routine vma_needs_reservation() is called to determine if a
+reservation exists for the address within the mapping(vma). See the section
+"Reservation Map Helper Routines" for detailed information on what this
+routine does. The value returned from vma_needs_reservation() is generally
+0 or 1. 0 if a reservation exists for the address, 1 if no reservation exists.
+If a reservation does not exist, and there is a subpool associated with the
+mapping the subpool is consulted to determine if it contains reservations.
+If the subpool contains reservations, one can be used for this allocation.
+However, in every case the avoid_reserve argument overrides the use of
+a reservation for the allocation. After determining whether a reservation
+exists and can be used for the allocation, the routine dequeue_huge_page_vma()
+is called. This routine takes two arguments related to reservations:
+- avoid_reserve, this is the same value/argument passed to alloc_huge_page()
+- chg, even though this argument is of type long only the values 0 or 1 are
+ passed to dequeue_huge_page_vma. If the value is 0, it indicates a
+ reservation exists (see the section "Memory Policy and Reservations" for
+ possible issues). If the value is 1, it indicates a reservation does not
+ exist and the page must be taken from the global free pool if possible.
+The free lists associated with the memory policy of the VMA are searched for
+a free page. If a page is found, the value free_huge_pages is decremented
+when the page is removed from the free list. If there was a reservation
+associated with the page, the following adjustments are made:
+ SetPagePrivate(page); /* Indicates allocating this page consumed
+ * a reservation, and if an error is
+ * encountered such that the page must be
+ * freed, the reservation will be restored. */
+ resv_huge_pages--; /* Decrement the global reservation count */
+Note, if no huge page can be found that satisfies the VMA's memory policy
+an attempt will be made to allocate one using the buddy allocator. This
+brings up the issue of surplus huge pages and overcommit which is beyond
+the scope reservations. Even if a surplus page is allocated, the same
+reservation based adjustments as above will be made: SetPagePrivate(page) and
+resv_huge_pages--.
+
+After obtaining a new huge page, (page)->private is set to the value of
+the subpool associated with the page if it exists. This will be used for
+subpool accounting when the page is freed.
+
+The routine vma_commit_reservation() is then called to adjust the reserve
+map based on the consumption of the reservation. In general, this involves
+ensuring the page is represented within a file_region structure of the region
+map. For shared mappings where the the reservation was present, an entry
+in the reserve map already existed so no change is made. However, if there
+was no reservation in a shared mapping or this was a private mapping a new
+entry must be created.
+
+It is possible that the reserve map could have been changed between the call
+to vma_needs_reservation() at the beginning of alloc_huge_page() and the
+call to vma_commit_reservation() after the page was allocated. This would
+be possible if hugetlb_reserve_pages was called for the same page in a shared
+mapping. In such cases, the reservation count and subpool free page count
+will be off by one. This rare condition can be identified by comparing the
+return value from vma_needs_reservation and vma_commit_reservation. If such
+a race is detected, the subpool and global reserve counts are adjusted to
+compensate. See the section "Reservation Map Helper Routines" for more
+information on these routines.
+
+
+Instantiate Huge Pages
+----------------------
+After huge page allocation, the page is typically added to the page tables
+of the allocating task. Before this, pages in a shared mapping are added
+to the page cache and pages in private mappings are added to an anonymous
+reverse mapping. In both cases, the PagePrivate flag is cleared. Therefore,
+when a huge page that has been instantiated is freed no adjustment is made
+to the global reservation count (resv_huge_pages).
+
+
+Freeing Huge Pages
+------------------
+Huge page freeing is performed by the routine free_huge_page(). This routine
+is the destructor for hugetlbfs compound pages. As a result, it is only
+passed a pointer to the page struct. When a huge page is freed, reservation
+accounting may need to be performed. This would be the case if the page was
+associated with a subpool that contained reserves, or the page is being freed
+on an error path where a global reserve count must be restored.
+
+The page->private field points to any subpool associated with the page.
+If the PagePrivate flag is set, it indicates the global reserve count should
+be adjusted (see the section "Consuming Reservations/Allocating a Huge Page"
+for information on how these are set).
+
+The routine first calls hugepage_subpool_put_pages() for the page. If this
+routine returns a value of 0 (which does not equal the value passed 1) it
+indicates reserves are associated with the subpool, and this newly free page
+must be used to keep the number of subpool reserves above the minimum size.
+Therefore, the global resv_huge_pages counter is incremented in this case.
+
+If the PagePrivate flag was set in the page, the global resv_huge_pages counter
+will always be incremented.
+
+
+Subpool Reservations
+--------------------
+There is a struct hstate associated with each huge page size. The hstate
+tracks all huge pages of the specified size. A subpool represents a subset
+of pages within a hstate that is associated with a mounted hugetlbfs
+filesystem.
+
+When a hugetlbfs filesystem is mounted a min_size option can be specified
+which indicates the minimum number of huge pages required by the filesystem.
+If this option is specified, the number of huge pages corresponding to
+min_size are reserved for use by the filesystem. This number is tracked in
+the min_hpages field of a struct hugepage_subpool. At mount time,
+hugetlb_acct_memory(min_hpages) is called to reserve the specified number of
+huge pages. If they can not be reserved, the mount fails.
+
+The routines hugepage_subpool_get/put_pages() are called when pages are
+obtained from or released back to a subpool. They perform all subpool
+accounting, and track any reservations associated with the subpool.
+hugepage_subpool_get/put_pages are passed the number of huge pages by which
+to adjust the subpool 'used page' count (down for get, up for put). Normally,
+they return the same value that was passed or an error if not enough pages
+exist in the subpool.
+
+However, if reserves are associated with the subpool a return value less
+than the passed value may be returned. This return value indicates the
+number of additional global pool adjustments which must be made. For example,
+suppose a subpool contains 3 reserved huge pages and someone asks for 5.
+The 3 reserved pages associated with the subpool can be used to satisfy part
+of the request. But, 2 pages must be obtained from the global pools. To
+relay this information to the caller, the value 2 is returned. The caller
+is then responsible for attempting to obtain the additional two pages from
+the global pools.
+
+
+COW and Reservations
+--------------------
+Since shared mappings all point to and use the same underlying pages, the
+biggest reservation concern for COW is private mappings. In this case,
+two tasks can be pointing at the same previously allocated page. One task
+attempts to write to the page, so a new page must be allocated so that each