summaryrefslogtreecommitdiffstats
path: root/mm/filemap.c
AgeCommit message (Collapse)Author
2020-09-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Two minor conflicts: 1) net/ipv4/route.c, adding a new local variable while moving another local variable and removing it's initial assignment. 2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes. One pretty prints the port mode differently, whilst another changes the driver to try and obtain the port mode from the port node rather than the switch node. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-20mm: fix wake_page_function() comment typosLinus Torvalds
Sedat Dilek pointed out some silly comment typo issues. Reported-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-17mm: allow a controlled amount of unfairness in the page lockLinus Torvalds
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made the page locking entirely fair, in that if a waiter came in while the lock was held, the lock would be transferred to the lockers strictly in order. That was intended to finally get rid of the long-reported watchdog failures that involved the page lock under extreme load, where a process could end up waiting essentially forever, as other page lockers stole the lock from under it. It also improved some benchmarks, but it ended up causing huge performance regressions on others, simply because fair lock behavior doesn't end up giving out the lock as aggressively, causing better worst-case latency, but potentially much worse average latencies and throughput. Instead of reverting that change entirely, this introduces a controlled amount of unfairness, with a sysctl knob to tune it if somebody needs to. But the default value should hopefully be good for any normal load, allowing a few rounds of lock stealing, but enforcing the strict ordering before the lock has been stolen too many times. There is also a hint from Matthieu Baerts that the fair page coloring may end up exposing an ABBA deadlock that is hidden by the usual optimistic lock stealing, and while the unfairness doesn't fix the fundamental issue (and I'm still looking at that), it avoids it in practice. The amount of unfairness can be modified by writing a new value to the 'sysctl_page_lock_unfairness' variable (default value of 5, exposed through /proc/sys/vm/page_lock_unfairness), but that is hopefully something we'd use mainly for debugging rather than being necessary for any deep system tuning. This whole issue has exposed just how critical the page lock can be, and how contended it gets under certain locks. And the main contention doesn't really seem to be anything related to IO (which was the origin of this lock), but for things like just verifying that the page file mapping is stable while faulting in the page into a page table. Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/ Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1 Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/ Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com> Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Jan Kara <jack@suse.cz> Cc: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-28mm/error_inject: Fix allow_error_inject function signatures.Alexei Starovoitov
'static' and 'static noinline' function attributes make no guarantees that gcc/clang won't optimize them. The compiler may decide to inline 'static' function and in such case ALLOW_ERROR_INJECT becomes meaningless. The compiler could have inlined __add_to_page_cache_locked() in one callsite and didn't inline in another. In such case injecting errors into it would cause unpredictable behavior. It's worse with 'static noinline' which won't be inlined, but it still can be optimized. Like the compiler may decide to remove one argument or constant propagate the value depending on the callsite. To avoid such issues make sure that these functions are global noinline. Fixes: af3b854492f3 ("mm/page_alloc.c: allow error injection") Fixes: cfcbfb1382db ("mm/filemap.c: enable error injection at add_to_page_cache()") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/bpf/20200827220114.69225-2-alexei.starovoitov@gmail.com
2020-08-14mm/filemap.c: fix a data race in filemap_fault()Kirill A. Shutemov
struct file_ra_state ra.mmap_miss could be accessed concurrently during page faults as noticed by KCSAN, BUG: KCSAN: data-race in filemap_fault / filemap_map_pages write to 0xffff9b1700a2c1b4 of 4 bytes by task 3292 on cpu 30: filemap_fault+0x920/0xfc0 do_sync_mmap_readahead at mm/filemap.c:2384 (inlined by) filemap_fault at mm/filemap.c:2486 __xfs_filemap_fault+0x112/0x3e0 [xfs] xfs_filemap_fault+0x74/0x90 [xfs] __do_fault+0x9e/0x220 do_fault+0x4a0/0x920 __handle_mm_fault+0xc69/0xd00 handle_mm_fault+0xfc/0x2f0 do_page_fault+0x263/0x6f9 page_fault+0x34/0x40 read to 0xffff9b1700a2c1b4 of 4 bytes by task 3313 on cpu 32: filemap_map_pages+0xc2e/0xd80 filemap_map_pages at mm/filemap.c:2625 do_fault+0x3da/0x920 __handle_mm_fault+0xc69/0xd00 handle_mm_fault+0xfc/0x2f0 do_page_fault+0x263/0x6f9 page_fault+0x34/0x40 Reported by Kernel Concurrency Sanitizer on: CPU: 32 PID: 3313 Comm: systemd-udevd Tainted: G W L 5.5.0-next-20200210+ #1 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 ra.mmap_miss is used to contribute the readahead decisions, a data race could be undesirable. Both the read and write is only under non-exclusive mmap_sem, two concurrent writers could even underflow the counter. Fix the underflow by writing to a local variable before committing a final store to ra.mmap_miss given a small inaccuracy of the counter should be acceptable. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Qian Cai <cai@lca.pw> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Marco Elver <elver@google.com> Link: http://lkml.kernel.org/r/20200211030134.1847-1-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-14mm: replace hpage_nr_pages with thp_nr_pagesMatthew Wilcox (Oracle)
The thp prefix is more frequently used than hpage and we should be consistent between the various functions. [akpm@linux-foundation.org: fix mm/migrate.c] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/filemap.c: delete duplicated wordRandy Dunlap
Drop the repeated word "the". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Link: http://lkml.kernel.org/r/20200801173822.14973-3-rdunlap@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: filemap: add missing FGP_ flags in kerneldoc comment for pagecache_get_pageYang Shi
FGP_{WRITE|NOFS|NOWAIT} were missed in pagecache_get_page's kerneldoc comment. Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Gang Deng <gavin.dg@linux.alibaba.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/1593031747-4249-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: filemap: clear idle flag for writesYang Shi
Since commit bbddabe2e436aa ("mm: filemap: only do access activations on reads"), mark_page_accessed() is called for reads only. But the idle flag is cleared by mark_page_accessed() so the idle flag won't get cleared if the page is write accessed only. Basically idle page tracking is used to estimate workingset size of workload, noticeable size of workingset might be missed if the idle flag is not maintained correctly. It seems good enough to just clear idle flag for write operations. Fixes: bbddabe2e436 ("mm: filemap: only do access activations on reads") Reported-by: Gang Deng <gavin.dg@linux.alibaba.com> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/1593020612-13051-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-03Merge tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring updates from Jens Axboe: "Lots of cleanups in here, hardening the code and/or making it easier to read and fixing bugs, but a core feature/change too adding support for real async buffered reads. With the latter in place, we just need buffered write async support and we're done relying on kthreads for the fast path. In detail: - Cleanup how memory accounting is done on ring setup/free (Bijan) - sq array offset calculation fixup (Dmitry) - Consistently handle blocking off O_DIRECT submission path (me) - Support proper async buffered reads, instead of relying on kthread offload for that. This uses the page waitqueue to drive retries from task_work, like we handle poll based retry. (me) - IO completion optimizations (me) - Fix race with accounting and ring fd install (me) - Support EPOLLEXCLUSIVE (Jiufei) - Get rid of the io_kiocb unionizing, made possible by shrinking other bits (Pavel) - Completion side cleanups (Pavel) - Cleanup REQ_F_ flags handling, and kill off many of them (Pavel) - Request environment grabbing cleanups (Pavel) - File and socket read/write cleanups (Pavel) - Improve kiocb_set_rw_flags() (Pavel) - Tons of fixes and cleanups (Pavel) - IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang)" * tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block: (127 commits) io_uring: flip if handling after io_setup_async_rw fs: optimise kiocb_set_rw_flags() io_uring: don't touch 'ctx' after installing file descriptor io_uring: get rid of atomic FAA for cq_timeouts io_uring: consolidate *_check_overflow accounting io_uring: fix stalled deferred requests io_uring: fix racy overflow count reporting io_uring: deduplicate __io_complete_rw() io_uring: de-unionise io_kiocb io-wq: update hash bits io_uring: fix missing io_queue_linked_timeout() io_uring: mark ->work uninitialised after cleanup io_uring: deduplicate io_grab_files() calls io_uring: don't do opcode prep twice io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works io_uring: batch put_task_struct() tasks: add put_task_struct_many() io_uring: return locked and pinned page accounting io_uring: don't miscount pinned memory io_uring: don't open-code recv kbuf managment ...
2020-08-02list: add "list_del_init_careful()" to go with "list_empty_careful()"Linus Torvalds
That gives us ordering guarantees around the pair. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-02mm: rewrite wait_on_page_bit_common() logicLinus Torvalds
It turns out that wait_on_page_bit_common() had several problems, ranging from just unfair behavioe due to re-queueing at the end of the wait queue when re-trying, and an outright bug that could result in missed wakeups (but probably never happened in practice). This rewrites the whole logic to avoid both issues, by simply moving the logic to check (and possibly take) the bit lock into the wakeup path instead. That makes everything much more straightforward, and means that we never need to re-queue the wait entry: if we get woken up, we'll be notified through WQ_FLAG_WOKEN, and the wait queue entry will have been removed, and everything will have been done for us. Link: https://lore.kernel.org/lkml/CAHk-=wjJA2Z3kUFb-5s=6+n0qbTs8ELqKFt9B3pH85a8fGD73w@mail.gmail.com/ Link: https://lore.kernel.org/lkml/alpine.LSU.2.11.2007221359450.1017@eggly.anvils/ Reported-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@suse.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-07fs: Add IOCB_NOIO flag for generic_file_read_iterAndreas Gruenbacher
Add an IOCB_NOIO flag that indicates to generic_file_read_iter that it shouldn't trigger any filesystem I/O for the actual request or for readahead. This allows to do tentative reads out of the page cache as some filesystems allow, and to take the appropriate locks and retry the reads only if the requested pages are not cached. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-06-21mm: support async buffered reads in generic_file_buffered_read()Jens Axboe
Use the async page locking infrastructure, if IOCB_WAITQ is set in the passed in iocb. The caller must expect an -EIOCBQUEUED return value, which means that IO is started but not done yet. This is similar to how O_DIRECT signals the same operation. Once the callback is received by the caller for IO completion, the caller must retry the operation. Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21mm: add support for async page lockingJens Axboe
Normally waiting for a page to become unlocked, or locking the page, requires waiting for IO to complete. Add support for lock_page_async() and wait_on_page_locked_async(), which are callback based instead. This allows a caller to get notified when a page becomes unlocked, rather than wait for it. We add a new iocb field, ki_waitq, to pass in the necessary data for this to happen. We can unionize this with ki_cookie, since that is only used for polled IO. Polled IO can never co-exist with async callbacks, as it is (by definition) polled completions. struct wait_page_key is made public, and we define struct wait_page_async as the interface between the caller and the core. Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21mm: abstract out wake_page_match() from wake_page_function()Jens Axboe
No functional changes in this patch, just in preparation for allowing more callers. Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21mm: allow read-ahead with IOCB_NOWAIT setJens Axboe
The read-ahead shouldn't block, so allow it to be done even if IOCB_NOWAIT is set in the kiocb. Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-09mmap locking API: convert mmap_sem commentsMichel Lespinasse
Convert comments that reference mmap_sem to reference mmap_lock instead. [akpm@linux-foundation.org: fix up linux-next leftovers] [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil] [akpm@linux-foundation.org: more linux-next fixups, per Michel] Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09mmap locking API: convert mmap_sem API commentsMichel Lespinasse
Convert comments that reference old mmap_sem APIs to reference corresponding new mmap locking APIs instead. Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09mmap locking API: use coccinelle to convert mmap_sem rwsem call sitesMichel Lespinasse
This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm/filemap: fix a typo in comment "unneccssary"->"unnecessary"Ethon Paul
There is a typo in comment, fix it. Signed-off-by: Ethon Paul <ethp@qq.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Link: http://lkml.kernel.org/r/20200411065141.15936-1-ethp@qq.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: memcontrol: delete unused lrucare handlingJohannes Weiner
Swapin faults were the last event to charge pages after they had already been put on the LRU list. Now that we charge directly on swapin, the lrucare portion of the charge code is unused. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Link: http://lkml.kernel.org/r/20200508183105.225460-19-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: memcontrol: convert anon and file-thp to new mem_cgroup_charge() APIJohannes Weiner
With the page->mapping requirement gone from memcg, we can charge anon and file-thp pages in one single step, right after they're allocated. This removes two out of three API calls - especially the tricky commit step that needed to happen at just the right time between when the page is "set up" and when it's "published" - somewhat vague and fluid concepts that varied by page type. All we need is a freshly allocated page and a memcg context to charge. v2: prevent double charges on pre-allocated hugepages in khugepaged [hannes@cmpxchg.org: Fix crash - *hpage could be ERR_PTR instead of NULL] Link: http://lkml.kernel.org/r/20200512215813.GA487759@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/20200508183105.225460-13-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: memcontrol: switch to native NR_FILE_PAGES and NR_SHMEM countersJohannes Weiner
Memcg maintains private MEMCG_CACHE and NR_SHMEM counters. This divergence from the generic VM accounting means unnecessary code overhead, and creates a dependency for memcg that page->mapping is set up at the time of charging, so that page types can be told apart. Convert the generic accounting sites to mod_lruvec_page_state and friends to maintain the per-cgroup vmstat counters of NR_FILE_PAGES and NR_SHMEM. The page is already locked in these places, so page->mem_cgroup is stable; we only need minimal tweaks of two mem_cgroup_migrate() calls to ensure it's set up in time. Then replace MEMCG_CACHE with NR_FILE_PAGES and delete the private NR_SHMEM accounting sites. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-10-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: memcontrol: convert page cache to a new mem_cgroup_charge() APIJohannes Weiner
The try/commit/cancel protocol that memcg uses dates back to when pages used to be uncharged upon removal from the page cache, and thus couldn't be committed before the insertion had succeeded. Nowadays, pages are uncharged when they are physically freed; it doesn't matter whether the insertion was successful or not. For the page cache, the transaction dance has become unnecessary. Introduce a mem_cgroup_charge() function that simply charges a newly allocated page to a cgroup and sets up page->mem_cgroup in one single step. If the insertion fails, the caller doesn't have to do anything but free/put the page. Then switch the page cache over to this new API. Subsequent patches will also convert anon pages, but it needs a bit more prep work. Right now, memcg depends on page->mapping being already set up at the time of charging, so that it can maintain its own MEMCG_CACHE and MEMCG_RSS counters. For anon, page->mapping is set under the same pte lock under which the page is publishd, so a single charge point that can block doesn't work there just yet. The following prep patches will replace the private memcg counters with the generic vmstat counters, thus removing the page->mapping dependency, then complete the transition to the new single-point charge API and delete the old transactional scheme. v2: leave shmem swapcache when charging fails to avoid double IO (Joonsoo) v3: rebase on preceeding shmem simplification patch Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-6-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: memcontrol: drop @compound parameter from memcg charging APIJohannes Weiner
The memcg charging API carries a boolean @compound parameter that tells whether the page we're dealing with is a hugepage. mem_cgroup_commit_charge() has another boolean @lrucare that indicates whether the page needs LRU locking or not while charging. The majority of callsites know those parameters at compile time, which results in a lot of naked "false, false" argument lists. This makes for cryptic code and is a breeding ground for subtle mistakes. Thankfully, the huge page state can be inferred from the page itself and doesn't need to be passed along. This is safe because charging completes before the page is published and somebody may split it. Simplify the callsites by removing @compound, and let memcg infer the state by using hpage_nr_pages() unconditionally. That function does PageTransHuge() to identify huge pages, which also helpfully asserts that nobody passes in tail pages by accident. The following patches will introduce a new charging API, best not to carry over unnecessary weight. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: fix NUMA node file count error in replace_page_cache()Johannes Weiner
Patch series "mm: memcontrol: charge swapin pages on instantiation", v2. This patch series reworks memcg to charge swapin pages directly at swapin time, rather than at fault time, which may be much later, or not happen at all. Changes in version 2: - prevent double charges on pre-allocated hugepages in khugepaged - leave shmem swapcache when charging fails to avoid double IO (Joonsoo) - fix temporary accounting bug by switching rmap<->commit (Joonsoo) - fix double swap charge bug in cgroup1/cgroup2 code gating - simplify swapin error checking (Joonsoo) - mm: memcontrol: document the new swap control behavior (Alex) - review tags The delayed swapin charging scheme we have right now causes problems: - Alex's per-cgroup lru_lock patches rely on pages that have been isolated from the LRU to have a stable page->mem_cgroup; otherwise the lock may change underneath him. Swapcache pages are charged only after they are added to the LRU, and charging doesn't follow the LRU isolation protocol. - Joonsoo's anon workingset patches need a suitable LRU at the time the page enters the swap cache and displaces the non-resident info. But the correct LRU is only available after charging. - It's a containment hole / DoS vector. Users can trigger arbitrarily large swap readahead using MADV_WILLNEED. The memory is never charged unless somebody actually touches it. - It complicates the page->mem_cgroup stabilization rules In order to charge pages directly at swapin time, the memcg code base needs to be prepared, and several overdue cleanups become a necessity: To charge pages at swapin time, we need to always have cgroup ownership tracking of swap records. We also cannot rely on page->mapping to tell apart page types at charge time, because that's only set up during a page fault. To eliminate the page->mapping dependency, memcg needs to ditch its private page type counters (MEMCG_CACHE, MEMCG_RSS, NR_SHMEM) in favor of the generic vmstat counters and accounting sites, such as NR_FILE_PAGES, NR_ANON_MAPPED etc. To switch to generic vmstat counters, the charge sequence must be adjusted such that page->mem_cgroup is set up by the time these counters are modified. The series is structured as follows: 1. Bug fixes 2. Decoupling charging from rmap 3. Swap controller integration into memcg 4. Direct swapin charging This patch (of 19): When replacing one page with another one in the cache, we have to decrease the file count of the old page's NUMA node and increase the one of the new NUMA node, otherwise the old node leaks the count and the new node eventually underflows its counter. Fixes: 74d609585d8b ("page cache: Add and replace pages using the XArray") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Balbir Singh <bsingharora@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/20200508183105.225460-1-hannes@cmpxchg.org Link: http://lkml.kernel.org/r/20200508183105.225460-2-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02Merge tag 'for-5.8-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Highlights: - speedup dead root detection during orphan cleanup, eg. when there are many deleted subvolumes waiting to be cleaned, the trees are now looked up in radix tree instead of a O(N^2) search - snapshot creation with inherited qgroup will mark the qgroup inconsistent, requires a rescan - send will emit file capabilities after chown, this produces a stream that does not need postprocessing to set the capabilities again - direct io ported to iomap infrastructure, cleaned up and simplified code, notably removing last use of struct buffer_head in btrfs code Core changes: - factor out backreference iteration, to be used by ordinary backreferences and relocation code - improved global block reserve utilization * better logic to serialize requests * increased maximum available for unlink * improved handling on large pages (64K) - direct io cleanups and fixes * simplify layering, where cloned bios were unnecessarily created for some cases * error handling fixes (submit, endio) * remove repair worker thread, used to avoid deadlocks during repair - refactored block group reading code, preparatory work for new type of block group storage that should improve mount time on large filesystems Cleanups: - cleaned up (and slightly sped up) set/get helpers for metadata data structure members - root bit REF_COWS got renamed to SHAREABLE to reflect the that the blocks of the tree get shared either among subvolumes or with the relocation trees Fixes: - when subvolume deletion fails due to ENOSPC, the filesystem is not turned read-only - device scan deals with devices from other filesystems that changed ownership due to overwrite (mkfs) - fix a race between scrub and block group removal/allocation - fix long standing bug of a runaway balance operation, printing the same line to the syslog, caused by a stale status bit on a reloc tree that prevented progress - fix corrupt log due to concurrent fsync of inodes with shared extents - fix space underflow for NODATACOW and buffered writes when it for some reason needs to fallback to COW mode" * tag 'for-5.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (133 commits) btrfs: fix space_info bytes_may_use underflow during space cache writeout btrfs: fix space_info bytes_may_use underflow after nocow buffered write btrfs: fix wrong file range cleanup after an error filling dealloc range btrfs: remove redundant local variable in read_block_for_search btrfs: open code key_search btrfs: split btrfs_direct_IO to read and write part btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK fs: remove dio_end_io() btrfs: switch to iomap_dio_rw() for dio iomap: remove lockdep_assert_held() iomap: add a filesystem hook for direct I/O bio submission fs: export generic_file_buffered_read() btrfs: turn space cache writeout failure messages into debug messages btrfs: include error on messages about failure to write space/inode caches btrfs: remove useless 'fail_unlock' label from btrfs_csum_file_blocks() btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums btrfs: make checksum item extension more efficient btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents btrfs: unexport btrfs_compress_set_level() btrfs: simplify iget helpers ...
2020-06-02mm/filemap.c: remove misleading commentMatthew Wilcox (Oracle)
We no longer return 0 here and the comment doesn't tell us anything that we don't already know (SIGBUS is a pretty good indicator that things didn't work out). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Link: http://lkml.kernel.org/r/20200529123243.20640-1-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-25fs: export generic_file_buffered_read()Goldwyn Rodrigues
Export generic_file_buffered_read() to be used to supplement incomplete direct reads. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-07mm: huge tmpfs: try to split_huge_page() when punching holeHugh Dickins
Yang Shi writes: Currently, when truncating a shmem file, if the range is partly in a THP (start or end is in the middle of THP), the pages actually will just get cleared rather than being freed, unless the range covers the whole THP. Even though all the subpages are truncated (randomly or sequentially), the THP may still be kept in page cache. This might be fine for some usecases which prefer preserving THP, but balloon inflation is handled in base page size. So when using shmem THP as memory backend, QEMU inflation actually doesn't work as expected since it doesn't free memory. But the inflation usecase really needs to get the memory freed. (Anonymous THP will also not get freed right away, but will be freed eventually when all subpages are unmapped: whereas shmem THP still stays in page cache.) Split THP right away when doing partial hole punch, and if split fails just clear the page so that read of the punched area will return zeroes. Hugh Dickins adds: Our earlier "team of pages" huge tmpfs implementation worked in the way that Yang Shi proposes; and we have been using this patch to continue to split the huge page when hole-punched or truncated, since converting over to the compound page implementation. Although huge tmpfs gives out huge pages when available, if the user specifically asks to truncate or punch a hole (perhaps to free memory, perhaps to reduce the memcg charge), then the filesystem should do so as best it can, splitting the huge page. That is not always possible: any additional reference to the huge page prevents split_huge_page() from succeeding, so the result can be flaky. But in practice it works successfully enough that we've not seen any problem from that. Add shmem_punch_compound() to encapsulate the decision of when a split is needed, and doing the split if so. Using this simplifies the flow in shmem_undo_range(); and the first (trylock) pass does not need to do any page clearing on failure, because the second pass will either succeed or do that clearing. Following the example of zero_user_segment() when clearing a partial page, add flush_dcache_page() and set_page_dirty() when clearing a hole - though I'm not certain that either is needed. But: split_huge_page() would be sure to fail if shmem_undo_range()'s pagevec holds further references to the huge page. The easiest way to fix that is for find_get_entries() to return early, as soon as it has put one compound head or tail into the pagevec. At first this felt like a hack; but on examination, this convention better suits all its callers - or will do, if the slight one-page-per-pagevec slowdown in shmem_unlock_mapping() and shmem_seek_hole_data() is transformed into a 512-page-per-pagevec speedup by checking for compound pages there. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2002261959020.10801@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm: allow VM_FAULT_RETRY for multiple timesPeter Xu
The idea comes from a discussion between Linus and Andrea [1]. Before this patch we only allow a page fault to retry once. We achieved this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing handle_mm_fault() the second time. This was majorly used to avoid unexpected starvation of the system by looping over forever to handle the page fault on a single page. However that should hardly happen, and after all for each code path to return a VM_FAULT_RETRY we'll first wait for a condition (during which time we should possibly yield the cpu) to happen before VM_FAULT_RETRY is really returned. This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY flag when we receive VM_FAULT_RETRY. It means that the page fault handler now can retry the page fault for multiple times if necessary without the need to generate another page fault event. Meanwhile we still keep the FAULT_FLAG_TRIED flag so page fault handler can still identify whether a page fault is the first attempt or not. Then we'll have these combinations of fault flags (only considering ALLOW_RETRY flag and TRIED flag): - ALLOW_RETRY and !TRIED: this means the page fault allows to retry, and this is the first try - ALLOW_RETRY and TRIED: this means the page fault allows to retry, and this is not the first try - !ALLOW_RETRY and !TRIED: this means the page fault does not allow to retry at all - !ALLOW_RETRY and TRIED: this is forbidden and should never be used In existing code we have multiple places that has taken special care of the first condition above by checking against (fault_flags & FAULT_FLAG_ALLOW_RETRY). This patch introduces a simple helper to detect the first retry of a page fault by checking against both (fault_flags & FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now even the 2nd try will have the ALLOW_RETRY set, then use that helper in all existing special paths. One example is in __lock_page_or_retry(), now we'll drop the mmap_sem only in the first attempt of page fault and we'll keep it in follow up retries, so old locking behavior will be retained. This will be a nice enhancement for current code [2] at the same time a supporting material for the future userfaultfd-writeprotect work, since in that work there will always be an explicit userfault writeprotect retry for protected pages, and if that cannot resolve the page fault (e.g., when userfaultfd-writeprotect is used in conjunction with swapped pages) then we'll possibly need a 3rd retry of the page fault. It might also benefit other potential users who will have similar requirement like userfault write-protection. GUP code is not touched yet and will be covered in follow up patch. Please read the thread below for more information. [1] https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/ [2] https://lore.kernel.org/lkml/20181230154648.GB9832@redhat.com/ Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Brian Geffon <bgeffon@google.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220160246.9790-1-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm/filemap.c: rewrite pagecache_get_page documentationMatthew Wilcox (Oracle)
- These were never called PCG flags; they've been called FGP flags since their introduction in 2014. - The FGP_FOR_MMAP flag was misleadingly documented as if it was an alternative to FGP_CREAT instead of an option to it. - Rename the 'offset' parameter to 'index'. - Capitalisation, formatting, rewording. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/20200318140253.6141-9-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm/filemap.c: unexport find_get_entryMatthew Wilcox (Oracle)
No in-tree users (proc, madvise, memcg, mincore) can be built as a module. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/20200318140253.6141-8-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm/filemap.c: use vm_fault error code directlyMatthew Wilcox (Oracle)
Use VM_FAULT_OOM instead of indirecting through vmf_error(-ENOMEM). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/20200318140253.6141-2-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm/filemap.c: remove unused argument from shrink_readahead_size_eio()Souptick Joarder
The first argument of shrink_readahead_size_eio() is not used. Hence remove it from the function definition and from all the callers. Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1583868093-24342-1-git-send-email-jrdr.linux@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm/filemap.c: clear page error before actual readXianting Tian
Mount failure issue happens under the scenario: Application forked dozens of threads to mount the same number of cramfs images separately in docker, but several mounts failed with high probability. Mount failed due to the checking result of the page(read from the superblock of loop dev) is not uptodate after wait_on_page_locked(page) returned in function cramfs_read: wait_on_page_locked(page); if (!PageUptodate(page)) { ... } The reason of the checking result of the page not uptodate: systemd-udevd read the loopX dev before mount, because the status of loopX is Lo_unbound at this time, so loop_make_request directly trigger the calling of io_end handler end_buffer_async_read, which called SetPageError(page). So It caused the page can't be set to uptodate in function end_buffer_async_read: if(page_uptodate && !PageError(page)) { SetPageUptodate(page); } Then mount operation is performed, it used the same page which is just accessed by systemd-udevd above, Because this page is not uptodate, it will launch a actual read via submit_bh, then wait on this page by calling wait_on_page_locked(page). When the I/O of the page done, io_end handler end_buffer_async_read is called, because no one cleared the page error(during the whole read path of mount), which is caused by systemd-udevd reading, so this page is still in "PageError" status, which can't be set to uptodate in function end_buffer_async_read, then caused mount failure. But sometimes mount succeed even through systemd-udeved read loopX dev just before, The reason is systemd-udevd launched other loopX read just between step 3.1 and 3.2, the steps as below: 1, loopX dev default status is Lo_unbound; 2, systemd-udved read loopX dev (page is set to PageError); 3, mount operation 1) set loopX status to Lo_bound; ==>systemd-udevd read loopX dev<== 2) read loopX dev(page has no error) 3) mount succeed As the loopX dev status is set to Lo_bound after step 3.1, so the other loopX dev read by systemd-udevd will go through the whole I/O stack, part of the call trace as below: SYS_read vfs_read do_sync_read blkdev_aio_read generic_file_aio_read do_generic_file_read: ClearPageError(page); mapping->a_ops->readpage(filp, page); here, mapping->a_ops->readpage() is blkdev_readpage. In latest kernel, some function name changed, the call trace as below: blkdev_read_iter generic_file_read_iter generic_file_buffered_read: /* * A previous I/O error may have been due to temporary * failures, eg. mutipath errors. * Pg_error will be set again if readpage fails. */ ClearPageError(page); /* Start the actual read. The read will unlock the page*/ error=mapping->a_ops->readpage(flip, page); We can see ClearPageError(page) is called before the actual read, then the read in step 3.2 succeed. This patch is to add the calling of ClearPageError just before the actual read of read path of cramfs mount. Without the patch, the call trace as below when performing cramfs mount: do_mount cramfs_read cramfs_blkdev_read read_cache_page do_read_cache_page: filler(data, page); or mapping->a_ops->readpage(data, page); With the patch, the call trace as below when performing mount: do_mount cramfs_read cramfs_blkdev_read read_cache_page: do_read_cache_page: ClearPageError(page); <== new add filler(data, page); or mapping->a_ops->readpage(data, page); With the patch, mount operation trigger the calling of ClearPageError(page) before the actual read, the page has no error if no additional page error happen when I/O done. Signed-off-by: Xianting Tian <xianting_tian@126.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Jan Kara <jack@suse.cz> Cc: <yubin@h3c.com> Link: http://lkml.kernel.org/r/1583318844-22971-1-git-send-email-xianting_tian@126.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm/filemap.c: don't bother dropping mmap_sem for zero size readaheadJan Kara
When handling a page fault, we drop mmap_sem to sta