Age | Commit message (Collapse) | Author |
|
swap_extent is used to map swap page offset to backing device's block
offset. For a continuous block range, one swap_extent is used and all
these swap_extents are managed in a linked list.
These swap_extents are used by map_swap_entry() during swap's read and
write path. To find out the backing device's block offset for a page
offset, the swap_extent list will be traversed linearly, with
curr_swap_extent being used as a cache to speed up the search.
This works well as long as swap_extents are not huge or when the number
of processes that access swap device are few, but when the swap device
has many extents and there are a number of processes accessing the swap
device concurrently, it can be a problem. On one of our servers, the
disk's remaining size is tight:
$df -h
Filesystem Size Used Avail Use% Mounted on
... ...
/dev/nvme0n1p1 1.8T 1.3T 504G 72% /home/t4
When creating a 80G swapfile there, there are as many as 84656 swap
extents. The end result is, kernel spends abou 30% time in
map_swap_entry() and swap throughput is only 70MB/s.
As a comparison, when I used smaller sized swapfile, like 4G whose
swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and
map_swap_entry() is about 3%.
One downside of using rbtree for swap_extent is, 'struct rbtree' takes
24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more
for each swap_extent. For a swapfile that has 80k swap_extents, that
means 625KiB more memory consumed.
Test:
Since it's not possible to reboot that server, I can not test this patch
diretly there. Instead, I tested it on another server with NVMe disk.
I created a 20G swapfile on an NVMe backed XFS fs. By default, the
filesystem is quite clean and the created swapfile has only 2 extents.
Testing vanilla and this patch shows no obvious performance difference
when swapfile is not fragmented.
To see the patch's effects, I used some tweaks to manually fragment the
swapfile by breaking the extent at 1M boundary. This made the swapfile
have 20K extents.
nr_task=4
kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf)
vanilla 165191 90.77% 171798 90.21%
patched 858993 +420% 2.16% 715827 +317% 0.77%
nr_task=8
kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf)
vanilla 306783 92.19% 318145 87.76%
patched 954437 +211% 2.35% 1073741 +237% 1.57%
swapout: the throughput of swap out, in KB/s, higher is better 1st
map_swap_entry: cpu cycles percent sampled by perf swapin: the
throughput of swap in, in KB/s, higher is better. 2nd map_swap_entry:
cpu cycles percent sampled by perf
nr_task=1 doesn't show any difference, this is due to the curr_swap_extent
can be effectively used to cache the correct swap extent for single task
workload.
[akpm@linux-foundation.org: s/BUG_ON(1)/BUG()/]
Link: http://lkml.kernel.org/r/20190523142404.GA181@aaronlu
Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
total_swapcache_pages() may race with swapper_spaces[] allocation and
freeing. Previously, this is protected with a swapper_spaces[] specific
RCU mechanism. To simplify the logic/code complexity, it is replaced with
get/put_swap_device(). The code line number is reduced too. Although not
so important, the swapoff() performance improves too because one
synchronize_rcu() call during swapoff() is deleted.
[ying.huang@intel.com: fix bad swap file entry warning]
Link: http://lkml.kernel.org/r/20190531024102.21723-1-ying.huang@intel.com
Link: http://lkml.kernel.org/r/20190527082714.12151-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When swapin is performed, after getting the swap entry information from
the page table, system will swap in the swap entry, without any lock held
to prevent the swap device from being swapoff. This may cause the race
like below,
CPU 1 CPU 2
----- -----
do_swap_page
swapin_readahead
__read_swap_cache_async
swapoff swapcache_prepare
p->swap_map = NULL __swap_duplicate
p->swap_map[?] /* !!! NULL pointer access */
Because swapoff is usually done when system shutdown only, the race may
not hit many people in practice. But it is still a race need to be fixed.
To fix the race, get_swap_device() is added to check whether the specified
swap entry is valid in its swap device. If so, it will keep the swap
entry valid via preventing the swap device from being swapoff, until
put_swap_device() is called.
Because swapoff() is very rare code path, to make the normal path runs as
fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
reference count is used to implement get/put_swap_device(). >From
get_swap_device() to put_swap_device(), RCU reader side is locked, so
synchronize_rcu() in swapoff() will wait until put_swap_device() is
called.
In addition to swap_map, cluster_info, etc. data structure in the struct
swap_info_struct, the swap cache radix tree will be freed after swapoff,
so this patch fixes the race between swap cache looking up and swapoff
too.
Races between some other swap cache usages and swapoff are fixed too via
calling synchronize_rcu() between clearing PageSwapCache() and freeing
swap cache data structure.
Another possible method to fix this is to use preempt_off() +
stop_machine() to prevent the swap device from being swapoff when its data
structure is being accessed. The overhead in hot-path of both methods is
similar. The advantages of RCU based method are,
1. stop_machine() may disturb the normal execution code path on other
CPUs.
2. File cache uses RCU to protect its radix tree. If the similar
mechanism is used for swap cache too, it is easier to share code
between them.
3. RCU is used to protect swap cache in total_swapcache_pages() and
exit_swap_address_space() already. The two mechanisms can be
merged to simplify the logic.
Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
Fixes: 235b62176712 ("mm/swap: add cluster lock")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Not-nacked-by: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking
operations") changed when mmap_sem is dropped during filemap page fault
and when returning VM_FAULT_RETRY.
Correct the comment to reflect the change.
Link: http://lkml.kernel.org/r/1556234531-108228-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We can just pass a NULL filler and do the right thing inside of
do_read_cache_page based on the NULL parameter.
Link: http://lkml.kernel.org/r/20190520055731.24538-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "fix filler_t callback type mismatches", v2.
Casting mapping->a_ops->readpage to filler_t causes an indirect call
type mismatch with Control-Flow Integrity checking. This change fixes
the mismatch in read_cache_page_gfp and read_mapping_page by adding
using a NULL filler argument as an indication to call ->readpage
directly, and by passing the right parameter callbacks in nfs and jffs2.
This patch (of 4):
Code cleanup.
Link: http://lkml.kernel.org/r/20190520055731.24538-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When debug_pagealloc is enabled, we currently allocate the page_ext
array to mark guard pages with the PAGE_EXT_DEBUG_GUARD flag. Now that
we have the page_type field in struct page, we can use that instead, as
guard pages are neither PageSlab nor mapped to userspace. This reduces
memory overhead when debug_pagealloc is enabled and there are no other
features requiring the page_ext array.
Link: http://lkml.kernel.org/r/20190603143451.27353-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The page allocator checks struct pages for expected state (mapcount,
flags etc) as pages are being allocated (check_new_page()) and freed
(free_pages_check()) to provide some defense against errors in page
allocator users.
Prior commits 479f854a207c ("mm, page_alloc: defer debugging checks of
pages allocated from the PCP") and 4db7548ccbd9 ("mm, page_alloc: defer
debugging checks of freed pages until a PCP drain") this has happened
for order-0 pages as they were allocated from or freed to the per-cpu
caches (pcplists). Since those are fast paths, the checks are now
performed only when pages are moved between pcplists and global free
lists. This however lowers the chances of catching errors soon enough.
In order to increase the chances of the checks to catch errors, the
kernel has to be rebuilt with CONFIG_DEBUG_VM, which also enables
multiple other internal debug checks (VM_BUG_ON() etc), which is
suboptimal when the goal is to catch errors in mm users, not in mm code
itself.
To catch some wrong users of the page allocator we have
CONFIG_DEBUG_PAGEALLOC, which is designed to have virtually no overhead
unless enabled at boot time. Memory corruptions when writing to freed
pages have often the same underlying errors (use-after-free, double free)
as corrupting the corresponding struct pages, so this existing debugging
functionality is a good fit to extend by also perform struct page checks
at least as often as if CONFIG_DEBUG_VM was enabled.
Specifically, after this patch, when debug_pagealloc is enabled on boot,
and CONFIG_DEBUG_VM disabled, pages are checked when allocated from or
freed to the pcplists *in addition* to being moved between pcplists and
free lists. When both debug_pagealloc and CONFIG_DEBUG_VM are enabled,
pages are checked when being moved between pcplists and free lists *in
addition* to when allocated from or freed to the pcplists.
When debug_pagealloc is not enabled on boot, the overhead in fast paths
should be virtually none thanks to the use of static key.
Link: http://lkml.kernel.org/r/20190603143451.27353-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "debug_pagealloc improvements".
I have been recently debugging some pcplist corruptions, where it would be
useful to perform struct page checks immediately as pages are allocated
from and freed to pcplists, which is now only possible by rebuilding the
kernel with CONFIG_DEBUG_VM (details in Patch 2 changelog).
To make this kind of debugging simpler in future on a distro kernel, I
have improved CONFIG_DEBUG_PAGEALLOC so that it has even smaller overhead
when not enabled at boot time (Patch 1) and also when enabled (Patch 3),
and extended it to perform the struct page checks more often when enabled
(Patch 2). Now it can be configured in when building a distro kernel
without extra overhead, and debugging page use after free or double free
can be enabled simply by rebooting with debug_pagealloc=on.
This patch (of 3):
CONFIG_DEBUG_PAGEALLOC has been redesigned by 031bc5743f15
("mm/debug-pagealloc: make debug-pagealloc boottime configurable") to
allow being always enabled in a distro kernel, but only perform its
expensive functionality when booted with debug_pagelloc=on. We can
further reduce the overhead when not boot-enabled (including page
allocator fast paths) using static keys. This patch introduces one for
debug_pagealloc core functionality, and another for the optional guard
page functionality (enabled by booting with debug_guardpage_minorder=X).
Link: http://lkml.kernel.org/r/20190603143451.27353-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When failslab was originally written, the intention of the
"ignore-gfp-wait" flag default value ("N") was to fail GFP_ATOMIC
allocations. Those were defined as (__GFP_HIGH), and the code would test
for __GFP_WAIT (0x10u).
However, since then, __GFP_WAIT was replaced by __GFP_RECLAIM
(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM), and GFP_ATOMIC is now
defined as (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM).
This means that when the flag is false, almost no allocation ever fails
(as even GFP_ATOMIC allocations contain ___GFP_KSWAPD_RECLAIM).
Restore the original intent of the code, by ignoring calls that directly
reclaim only (__GFP_DIRECT_RECLAIM), and thus, failing GFP_ATOMIC calls
again by default.
Link: http://lkml.kernel.org/r/20190520214514.81360-1-drinkcat@chromium.org
Fixes: 71baba4b92dc1fa1 ("mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM")
Signed-off-by: Nicolas Boichat <drinkcat@chromium.org>
Reviewed-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Previously totalram_pages was the global variable. Currently,
totalram_pages is the static inline function from the include/linux/mm.h
However, the function is also marked as EXPORT_SYMBOL, which is at best an
odd combination. Because there is no point for the static inline function
from a public header to be exported, this commit removes the
EXPORT_SYMBOL() marking. It will be still possible to use the function in
modules because all the symbols it depends on are exported.
Link: http://lkml.kernel.org/r/20190710141031.15642-1-efremov@linux.com
Fixes: ca79b0c211af6 ("mm: convert totalram_pages and totalhigh_pages variables to atomic")
Signed-off-by: Denis Efremov <efremov@linux.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
undo_isolate_page_range() never fails, so no need to return value.
Link: http://lkml.kernel.org/r/1562075604-8979-1-git-send-email-kernelfans@gmail.com
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
account_page_dirtied() is only used by our set_page_dirty() helpers and
should not be used anywhere else.
Link: http://lkml.kernel.org/r/20190605183702.30572-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Make the success case use the same cleanup path as the failure case.
Link: http://lkml.kernel.org/r/20190523134024.GC24093@localhost.localdomain
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
follow_page_mask() is only used in gup.c, make it static.
Link: http://lkml.kernel.org/r/20190510190831.GA4061@bharath12345-Inspiron-5559
Signed-off-by: Bharath Vedartham <linux.bhar@gmail.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
ksize() has been unconditionally unpoisoning the whole shadow memory
region associated with an allocation. This can lead to various undetected
bugs, for example, double-kzfree().
Specifically, kzfree() uses ksize() to determine the actual allocation
size, and subsequently zeroes the memory. Since ksize() used to just
unpoison the whole shadow memory region, no invalid free was detected.
This patch addresses this as follows:
1. Add a check in ksize(), and only then unpoison the memory region.
2. Preserve kasan_unpoison_slab() semantics by explicitly unpoisoning
the shadow memory region using the size obtained from __ksize().
Tested:
1. With SLAB allocator: a) normal boot without warnings; b) verified the
added double-kzfree() is detected.
2. With SLUB allocator: a) normal boot without warnings; b) verified the
added double-kzfree() is detected.
[elver@google.com: s/BUG_ON/WARN_ON_ONCE/, per Kees]
Link: http://lkml.kernel.org/r/20190627094445.216365-6-elver@google.com
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199359
Link: http://lkml.kernel.org/r/20190626142014.141844-6-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This refactors common code of ksize() between the various allocators into
slab_common.c: __ksize() is the allocator-specific implementation without
instrumentation, whereas ksize() includes the required KASAN logic.
Link: http://lkml.kernel.org/r/20190626142014.141844-5-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This changes {,__}kasan_check_{read,write} functions to return a boolean
denoting if the access was valid or not.
[sfr@canb.auug.org.au: include types.h for "bool"]
Link: http://lkml.kernel.org/r/20190705184949.13cdd021@canb.auug.org.au
Link: http://lkml.kernel.org/r/20190626142014.141844-3-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "mm/kasan: Add object validation in ksize()", v3.
This patch (of 5):
This introduces __kasan_check_{read,write}. __kasan_check functions may
be used from anywhere, even compilation units that disable instrumentation
selectively.
This change eliminates the need for the __KASAN_INTERNAL definition.
[elver@google.com: v5]
Link: http://lkml.kernel.org/r/20190708170706.174189-2-elver@google.com
Link: http://lkml.kernel.org/r/20190626142014.141844-2-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This adds support for printing stack frame description on invalid stack
accesses. The frame description is embedded by the compiler, which is
parsed and then pretty-printed.
Currently, we can only print the stack frame info for accesses to the
task's own stack, but not accesses to other tasks' stacks.
Example of what it looks like:
page dumped because: kasan: bad access detected
addr ffff8880673ef98a is located in stack of task insmod/2008 at offset 106 in frame:
kasan_stack_oob+0x0/0xf5 [test_kasan]
this frame has 2 objects:
[32, 36) 'i'
[96, 106) 'stack_array'
Memory state around the buggy address:
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=198435
Link: http://lkml.kernel.org/r/20190522100048.146841-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
According to POSIX, EBUSY means that the "device or resource is busy", and
this can lead to people thinking that the file
`/sys/kernel/debug/kmemleak/` is somehow locked or being used by other
process. Change this error code to a more appropriate one.
Link: http://lkml.kernel.org/r/20190612155231.19448-1-andrealmeid@collabora.com
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
in_softirq() is a wrong predicate to check if we are in a softirq
context. It also returns true if we have BH disabled, so objects are
falsely stamped with "softirq" comm. The correct predicate is
in_serving_softirq().
If user does cat from /sys/kernel/debug/kmemleak previously they would
see this, which is clearly wrong, this is system call context (see the
comm):
unreferenced object 0xffff88805bd661c0 (size 64):
comm "softirq", pid 0, jiffies 4294942959 (age 12.400s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 ff ff ff ff 00 00 00 00 ................
00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................
backtrace:
[<0000000007dcb30c>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
[<0000000007dcb30c>] slab_post_alloc_hook mm/slab.h:439 [inline]
[<0000000007dcb30c>] slab_alloc mm/slab.c:3326 [inline]
[<0000000007dcb30c>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
[<00000000969722b7>] kmalloc include/linux/slab.h:547 [inline]
[<00000000969722b7>] kzalloc include/linux/slab.h:742 [inline]
[<00000000969722b7>] ip_mc_add1_src net/ipv4/igmp.c:1961 [inline]
[<00000000969722b7>] ip_mc_add_src+0x36b/0x400 net/ipv4/igmp.c:2085
[<00000000a4134b5f>] ip_mc_msfilter+0x22d/0x310 net/ipv4/igmp.c:2475
[<00000000d20248ad>] do_ip_setsockopt.isra.0+0x19fe/0x1c00 net/ipv4/ip_sockglue.c:957
[<000000003d367be7>] ip_setsockopt+0x3b/0xb0 net/ipv4/ip_sockglue.c:1246
[<000000003c7c76af>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616
[<000000000c1aeb23>] sock_common_setsockopt+0x3e/0x50 net/core/sock.c:3130
[<000000000157b92b>] __sys_setsockopt+0x9e/0x120 net/socket.c:2078
[<00000000a9f3d058>] __do_sys_setsockopt net/socket.c:2089 [inline]
[<00000000a9f3d058>] __se_sys_setsockopt net/socket.c:2086 [inline]
[<00000000a9f3d058>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086
[<000000001b8da885>] do_syscall_64+0x7c/0x1a0 arch/x86/entry/common.c:301
[<00000000ba770c62>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
now they will see this:
unreferenced object 0xffff88805413c800 (size 64):
comm "syz-executor.4", pid 8960, jiffies 4294994003 (age 14.350s)
hex dump (first 32 bytes):
00 7a 8a 57 80 88 ff ff e0 00 00 01 00 00 00 00 .z.W............
00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................
backtrace:
[<00000000c5d3be64>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
[<00000000c5d3be64>] slab_post_alloc_hook mm/slab.h:439 [inline]
[<00000000c5d3be64>] slab_alloc mm/slab.c:3326 [inline]
[<00000000c5d3be64>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
[<0000000023865be2>] kmalloc include/linux/slab.h:547 [inline]
[<0000000023865be2>] kzalloc include/linux/slab.h:742 [inline]
[<0000000023865be2>] ip_mc_add1_src net/ipv4/igmp.c:1961 [inline]
[<0000000023865be2>] ip_mc_add_src+0x36b/0x400 net/ipv4/igmp.c:2085
[<000000003029a9d4>] ip_mc_msfilter+0x22d/0x310 net/ipv4/igmp.c:2475
[<00000000ccd0a87c>] do_ip_setsockopt.isra.0+0x19fe/0x1c00 net/ipv4/ip_sockglue.c:957
[<00000000a85a3785>] ip_setsockopt+0x3b/0xb0 net/ipv4/ip_sockglue.c:1246
[<00000000ec13c18d>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616
[<0000000052d748e3>] sock_common_setsockopt+0x3e/0x50 net/core/sock.c:3130
[<00000000512f1014>] __sys_setsockopt+0x9e/0x120 net/socket.c:2078
[<00000000181758bc>] __do_sys_setsockopt net/socket.c:2089 [inline]
[<00000000181758bc>] __se_sys_setsockopt net/socket.c:2086 [inline]
[<00000000181758bc>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086
[<00000000d4b73623>] do_syscall_64+0x7c/0x1a0 arch/x86/entry/common.c:301
[<00000000c1098bec>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Link: http://lkml.kernel.org/r/20190517171507.96046-1-dvyukov@gmail.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently for CONFIG_SLUB, if a memcg kmem cache creation is failed and
the corresponding root kmem cache has SLAB_PANIC flag, the kernel will
be crashed. This is unnecessary as the kernel can handle the creation
failures of memcg kmem caches. Additionally CONFIG_SLAB does not
implement this behavior. So, to keep the behavior consistent between
SLAB and SLUB, removing the panic for memcg kmem cache creation
failures. The root kmem cache creation failure for SLAB_PANIC correctly
panics for both SLAB and SLUB.
Link: http://lkml.kernel.org/r/20190619232514.58994-1-shakeelb@google.com
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If ',' is not found, kmem_cache_flags() calls strlen() to find the end of
line. We can do it in a single pass using strchrnul().
Link: http://lkml.kernel.org/r/20190501053111.7950-1-ynorov@marvell.com
Signed-off-by: Yury Norov <ynorov@marvell.com>
Acked-by: Aaron Tomlin <atomlin@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This avoids any possible type confusion when looking up an object. For
example, if a non-slab were to be passed to kfree(), the invalid
slab_cache pointer (i.e. overlapped with some other value from the
struct page union) would be used for subsequent slab manipulations that
could lead to further memory corruption.
Since the page is already in cache, adding the PageSlab() check will
have nearly zero cost, so add a check and WARN() to virt_to_cache().
Additionally replaces an open-coded virt_to_cache(). To support the
failure mode this also updates all callers of virt_to_cache() and
cache_from_obj() to handle a NULL cache pointer return value (though
note that several already handle this case gracefully).
[dan.carpenter@oracle.com: restore IRQs in kfree()]
Link: http://lkml.kernel.org/r/20190613065637.GE16334@mwanda
Link: http://lkml.kernel.org/r/20190530045017.15252-3-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Alexander Popov <alex.popov@linux.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "mm/slab: Improved sanity checking".
This adds defenses against slab cache confusion (as seen in real-world
exploits[1]) and gracefully handles type confusions when trying to look
up slab caches from an arbitrary page. (Also is patch 3: new LKDTM
tests for these defenses as well as for the existing double-free
detection.
This patch (of 3):
When building under CONFIG_SLAB_FREELIST_HARDENING, it makes sense to
perform sanity-checking on the assumed slab cache during
kmem_cache_free() to make sure the kernel doesn't mix freelists across
slab caches and corrupt memory (as seen in the exploitation of flaws
like CVE-2018-9568[1]). Note that the prior code might WARN() but still
corrupt memory (i.e. return the assumed cache instead of the owned
cache).
There is no noticeable performance impact (changes are within noise).
Measuring parallel kernel builds, I saw the following with
CONFIG_SLAB_FREELIST_HARDENED, before and after this patch:
before:
Run times: 288.85 286.53 287.09 287.07 287.21
Min: 286.53 Max: 288.85 Mean: 287.35 Std Dev: 0.79
after:
Run times: 289.58 287.40 286.97 287.20 287.01
Min: 286.97 Max: 289.58 Mean: 287.63 Std Dev: 0.99
Delta: 0.1% which is well below the standard deviation
[1] https://github.com/ThomasKing2014/slides/raw/master/Building%20universal%20Android%20rooting%20with%20a%20type%20confusion%20vulnerability.pdf
Link: http://lkml.kernel.org/r/20190530045017.15252-2-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Popov <alex.popov@linux.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Following zsmalloc.c's example we call trylock_page() and unlock_page().
Also make z3fold_page_migrate() assert that newpage is passed in locked,
as per the documentation.
[akpm@linux-foundation.org: fix trylock_page return value test, per Shakeel]
Link: http://lkml.kernel.org/r/20190702005122.41036-1-henryburns@google.com
Link: http://lkml.kernel.org/r/20190702233538.52793-1-henryburns@google.com
Signed-off-by: Henry Burns <henryburns@google.com>
Suggested-by: Vitaly Wool <vitalywool@gmail.com>
Acked-by: Vitaly Wool <vitalywool@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Vitaly Vul <vitaly.vul@sony.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Xidong Wang <wangxidong_97@163.com>
Cc: Jonathan Adams <jwadams@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When we calculate total statistics for memcg1_stats and memcg1_events,
we use the the index 'i' in the for loop as the events index. Actually
we should use memcg1_stats[i] and memcg1_events[i] as the events index.
Link: http://lkml.kernel.org/r/1562116978-19539-1-git-send-email-laoar.shao@gmail.com
Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty").
Signed-off-by: Yafang Shao <laoar.shao@gmail.com
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yafang Shao <shaoyafang@didiglobal.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When file refaults are detected and there are many inactive file pages,
the system never reclaim anonymous pages, the file pages are dropped
aggressively when there are still a lot of cold anonymous pages and
system thrashes. This issue impacts the performance of applications
with large executable, e.g. chrome.
With this patch, when file refault is detected, inactive_list_is_low()
always returns true for file pages in get_scan_count() to enable
scanning anonymous pages.
The problem can be reproduced by the following test program.
---8<---
void fallocate_file(const char *filename, off_t size)
{
struct stat st;
int fd;
if (!stat(filename, &st) && st.st_size >= size)
return;
fd = open(filename, O_WRONLY | O_CREAT, 0600);
if (fd < 0) {
perror("create file");
exit(1);
}
if (posix_fallocate(fd, 0, size)) {
perror("fallocate");
exit(1);
}
close(fd);
}
long *alloc_anon(long size)
{
long *start = malloc(size);
memset(start, 1, size);
return start;
}
long access_file(const char *filename, long size, long rounds)
{
int fd, i;
volatile char *start1, *end1, *start2;
const int page_size = getpagesize();
long sum = 0;
fd = open(filename, O_RDONLY);
if (fd == -1) {
perror("open");
exit(1);
}
/*
* Some applications, e.g. chrome, use a lot of executable file
* pages, map some of the pages with PROT_EXEC flag to simulate
* the behavior.
*/
start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED,
fd, 0);
if (start1 == MAP_FAILED) {
perror("mmap");
exit(1);
}
end1 = start1 + size / 2;
start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2);
if (start2 == MAP_FAILED) {
perror("mmap");
exit(1);
}
for (i = 0; i < rounds; ++i) {
struct timeval before, after;
volatile char *ptr1 = start1, *ptr2 = start2;
gettimeofday(&before, NULL);
for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size)
sum += *ptr1 + *ptr2;
gettimeofday(&after, NULL);
printf("File access time, round %d: %f (sec)
", i,
(after.tv_sec - before.tv_sec) +
(after.tv_usec - before.tv_usec) / 1000000.0);
}
return sum;
}
int main(int argc, char *argv[])
{
const long MB = 1024 * 1024;
long anon_mb, file_mb, file_rounds;
const char filename[] = "large";
long *ret1;
long ret2;
if (argc != 4) {
printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS
");
exit(0);
}
anon_mb = atoi(argv[1]);
file_mb = atoi(argv[2]);
file_rounds = atoi(argv[3]);
fallocate_file(filename, file_mb * MB);
printf("Allocate %ld MB anonymous pages
", anon_mb);
ret1 = alloc_anon(anon_mb * MB);
printf("Access %ld MB file pages
", file_mb);
ret2 = access_file(filename, file_mb * MB, file_rounds);
printf("Print result to prevent optimization: %ld
",
*ret1 + ret2);
return 0;
}
---8<---
Running the test program on 2GB RAM VM with kernel 5.2.0-rc5, the program
fills ram with 2048 MB memory, access a 200 MB file for 10 times. Without
this patch, the file cache is dropped aggresively and every access to the
file is from disk.
$ ./thrash 2048 200 10
Allocate 2048 MB anonymous pages
Access 200 MB file pages
File access time, round 0: 2.489316 (sec)
File access time, round 1: 2.581277 (sec)
File access time, round 2: 2.487624 (sec)
File access time, round 3: 2.449100 (sec)
File access time, round 4: 2.420423 (sec)
File access time, round 5: 2.343411 (sec)
File access time, round 6: 2.454833 (sec)
File access time, round 7: 2.483398 (sec)
File access time, round 8: 2.572701 (sec)
File access time, round 9: 2.493014 (sec)
With this patch, these file pages can be cached.
$ ./thrash 2048 200 10
Allocate 2048 MB anonymous pages
Access 200 MB file pages
File access time, round 0: 2.475189 (sec)
File access time, round 1: 2.440777 (sec)
File access time, round 2: 2.411671 (sec)
File access time, round 3: 1.955267 (sec)
File access time, round 4: 0.029924 (sec)
File access time, round 5: 0.000808 (sec)
File access time, round 6: 0.000771 (sec)
File access time, round 7: 0.000746 (sec)
File access time, round 8: 0.000738 (sec)
File access time, round 9: 0.000747 (sec)
Checked the swap out stats during the test [1], 19006 pages swapped out
with this patch, 3418 pages swapped out without this patch. There are
more swap out, but I think it's within reasonable range when file backed
data set doesn't fit into the memory.
$ ./thrash 2000 100 2100 5 1 # ANON_MB FILE_EXEC FILE_NOEXEC ROUNDS
PROCESSES Allocate 2000 MB anonymous pages active_anon: 1613644,
inactive_anon: 348656, active_file: 892, inactive_file: 1384 (kB)
pswpout: 7972443, pgpgin: 478615246 Access 100 MB executable file pages
Access 2100 MB regular file pages File access time, round 0: 12.165,
(sec) active_anon: 1433788, inactive_anon: 478116, active_file: 17896,
inactive_file: 24328 (kB) File access time, round 1: 11.493, (sec)
active_anon: 1430576, inactive_anon: 477144, active_file: 25440,
inactive_file: 26172 (kB) File access time, round 2: 11.455, (sec)
active_anon: 1427436, inactive_anon: 476060, active_file: 21112,
inactive_file: 28808 (kB) File access time, round 3: 11.454, (sec)
active_anon: 1420444, inactive_anon: 473632, active_file: 23216,
inactive_file: 35036 (kB) File access time, round 4: 11.479, (sec)
active_anon: 1413964, inactive_anon: 471460, active_file: 31728,
inactive_file: 32224 (kB) pswpout: 7991449 (+ 19006), pgpgin: 489924366
(+ 11309120)
With 4 processes accessing non-overlapping parts of a large file, 30316
pages swapped out with this patch, 5152 pages swapped out without this
patch. The swapout number is small comparing to pgpgin.
[1]: https://github.com/vovo/testing/blob/master/mem_thrash.c
Link: http://lkml.kernel.org/r/20190701081038.GA83398@google.com
Fixes: e9868505987a ("mm,vmscan: only evict file pages when we have plenty")
Fixes: 7c5bd705d8f9 ("mm: memcg: only evict file pages when we have plenty")
Signed-off-by: Kuo-Hsin Yang <vovoy@chromium.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Sonny Rao <sonnyrao@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org> [4.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Many bug fixes and cleanups, and an optimization for case-insensitive
lookups"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix coverity warning on error path of filename setup
ext4: replace ktype default_attrs with default_groups
ext4: rename htree_inline_dir_to_tree() to ext4_inlinedir_to_tree()
ext4: refactor initialize_dirent_tail()
ext4: rename "dirent_csum" functions to use "dirblock"
ext4: allow directory holes
jbd2: drop declaration of journal_sync_buffer()
ext4: use jbd2_inode dirty range scoping
jbd2: introduce jbd2_inode dirty range scoping
mm: add filemap_fdatawait_range_keep_errors()
ext4: remove redundant assignment to node
ext4: optimize case-insensitive lookups
ext4: make __ext4_get_inode_loc plug
ext4: clean up kerneldoc warnigns when building with W=1
ext4: only set project inherit bit for directory
ext4: enforce the immutable flag on open files
ext4: don't allow any modifications to an immutable file
jbd2: fix typo in comment of journal_submit_inode_data_buffers
jbd2: fix some print format mistakes
ext4: gracefully handle ext4_break_layouts() failure during truncate
|
|
git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull copy_file_range updates from Darrick Wong:
"This fixes numerous parameter checking problems and inconsistent
behaviors in the new(ish) copy_file_range system call.
Now the system call will actually check its range parameters
correctly; refuse to copy into files for which the caller does not
have sufficient privileges; update mtime and strip setuid like file
writes are supposed to do; and allows copying up to the EOF of the
source file instead of failing the call like we used to.
Summary:
- Create a generic copy_file_range handler and make individual
filesystems responsible for calling it (i.e. no more assuming that
do_splice_direct will work or is appropriate)
- Refactor copy_file_range and remap_range parameter checking where
they are the same
- Install missing copy_file_range parameter checking(!)
- Remove suid/sgid and update mtime like any other file write
- Change the behavior so that a copy range crossing the source file's
eof will result in a short copy to the source file's eof instead of
EINVAL
- Permit filesystems to decide if they want to handle
cross-superblock copy_file_range in their local handlers"
* tag 'copy-file-range-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
fuse: copy_file_range needs to strip setuid bits and update timestamps
vfs: allow copy_file_range to copy across devices
xfs: use file_modified() helper
vfs: introduce file_modified() helper
vfs: add missing checks to copy_file_range
vfs: remove redundant checks from generic_remap_checks()
vfs: introduce generic_file_rw_checks()
vfs: no fallback for ->copy_file_range
vfs: introduce generic_copy_file_range()
|
|
Pull Documentation updates from Jonathan Corbet:
"It's been a relatively busy cycle for docs:
- A fair pile of RST conversions, many from Mauro. These create more
than the usual number of simple but annoying merge conflicts with
other trees, unfortunately. He has a lot more of these waiting on
the wings that, I think, will go to you directly later on.
- A new document on how to use merges and rebases in kernel repos,
and one on Spectre vulnerabilities.
- Various improvements to the build system, including automatic
markup of function() references because some people, for reasons I
will never understand, were of the opinion that
:c:func:``function()`` is unattractive and not fun to type.
- We now recommend using sphinx 1.7, but still support back to 1.4.
- Lots of smaller improvements, warning fixes, typo fixes, etc"
* tag 'docs-5.3' of git://git.lwn.net/linux: (129 commits)
docs: automarkup.py: ignore exceptions when seeking for xrefs
docs: Move binderfs to admin-guide
Disable Sphinx SmartyPants in HTML output
doc: RCU callback locks need only _bh, not necessarily _irq
docs: format kernel-parameters -- as code
Doc : doc-guide : Fix a typo
platform: x86: get rid of a non-existent document
Add the RCU docs to the core-api manual
Documentation: RCU: Add TOC tree hooks
Documentation: RCU: Rename txt files to rst
Documentation: RCU: Convert RCU UP systems to reST
Documentation: RCU: Convert RCU linked list to reST
Documentation: RCU: Convert RCU basic concepts to reST
docs: filesystems: Remove uneeded .rst extension on toctables
scripts/sphinx-pre-install: fix out-of-tree build
docs: zh_CN: submitting-drivers.rst: Remove a duplicated Documentation/
Documentation: PGP: update for newer HW devices
Documentation: Add section about CPU vulnerabilities for Spectre
Documentation: platform: Delete x86-laptop-drivers.txt
docs: Note that :c:func: should no longer be used
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull force_sig() argument change from Eric Biederman:
"A source of error over the years has been that force_sig has taken a
task parameter when it is only safe to use force_sig with the current
task.
The force_sig function is built for delivering synchronous signals
such as SIGSEGV where the userspace application caused a synchronous
fault (such as a page fault) and the kernel responded with a signal.
Because the name force_sig does not make this clear, and because the
force_sig takes a task parameter the function force_sig has been
abused for sending other kinds of signals over the years. Slowly those
have been fixed when the oopses have been tracked down.
This set of changes fixes the remaining abusers of force_sig and
carefully rips out the task parameter from force_sig and friends
making this kind of error almost impossible in the future"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm |