summaryrefslogtreecommitdiffstats
path: root/arch/powerpc/mm
AgeCommit message (Collapse)Author
2017-05-17powerpc/mm: Fix crash in page table dump with huge pagesMichael Ellerman
The page table dump code doesn't know about huge pages, so currently it crashes (or walks random memory, usually leading to a crash), if it finds a huge page. On Book3S we only see huge pages in the Linux page tables when we're using the P9 Radix MMU. Teaching the code to properly handle huge pages is a bit more involved, so for now just prevent the crash. Cc: stable@vger.kernel.org # v4.10+ Fixes: 8eb07b187000 ("powerpc/mm: Dump linux pagetables") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-08scripts/spelling.txt: add regsiter -> register spelling mistakeStephen Boyd
This typo is quite common. Fix it and add it to the spelling file so that checkpatch catches it earlier. Link: http://lkml.kernel.org/r/20170317011131.6881-2-sboyd@codeaurora.org Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-05Merge tag 'powerpc-4.12-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Highlights include: - Larger virtual address space on 64-bit server CPUs. By default we use a 128TB virtual address space, but a process can request access to the full 512TB by passing a hint to mmap(). - Support for the new Power9 "XIVE" interrupt controller. - TLB flushing optimisations for the radix MMU on Power9. - Support for CAPI cards on Power9, using the "Coherent Accelerator Interface Architecture 2.0". - The ability to configure the mmap randomisation limits at build and runtime. - Several small fixes and cleanups to the kprobes code, as well as support for KPROBES_ON_FTRACE. - Major improvements to handling of system reset interrupts, correctly treating them as NMIs, giving them a dedicated stack and using a new hypervisor call to trigger them, all of which should aid debugging and robustness. - Many fixes and other minor enhancements. Thanks to: Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual, Anton Blanchard, Balbir Singh, Ben Hutchings, Benjamin Herrenschmidt, Bhupesh Sharma, Chris Packham, Christian Zigotzky, Christophe Leroy, Christophe Lombard, Daniel Axtens, David Gibson, Gautham R. Shenoy, Gavin Shan, Geert Uytterhoeven, Guilherme G. Piccoli, Hamish Martin, Hari Bathini, Kees Cook, Laurent Dufour, Madhavan Srinivasan, Mahesh J Salgaonkar, Mahesh Salgaonkar, Masami Hiramatsu, Matt Brown, Matthew R. Ochs, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Pan Xinhui, Paul Mackerras, Rashmica Gupta, Russell Currey, Sukadev Bhattiprolu, Thadeu Lima de Souza Cascardo, Tobin C. Harding, Tyrel Datwyler, Uma Krishnan, Vaibhav Jain, Vipin K Parashar, Yang Shi" * tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (214 commits) powerpc/64s: Power9 has no LPCR[VRMASD] field so don't set it powerpc/powernv: Fix TCE kill on NVLink2 powerpc/mm/radix: Drop support for CPUs without lockless tlbie powerpc/book3s/mce: Move add_taint() later in virtual mode powerpc/sysfs: Move #ifdef CONFIG_HOTPLUG_CPU out of the function body powerpc/smp: Document irq enable/disable after migrating IRQs powerpc/mpc52xx: Don't select user-visible RTAS_PROC powerpc/powernv: Document cxl dependency on special case in pnv_eeh_reset() powerpc/eeh: Clean up and document event handling functions powerpc/eeh: Avoid use after free in eeh_handle_special_event() cxl: Mask slice error interrupts after first occurrence cxl: Route eeh events to all drivers in cxl_pci_error_detected() cxl: Force context lock during EEH flow powerpc/64: Allow CONFIG_RELOCATABLE if COMPILE_TEST powerpc/xmon: Teach xmon oops about radix vectors powerpc/mm/hash: Fix off-by-one in comment about kernel contexts ids powerpc/pseries: Enable VFIO powerpc/powernv: Fix iommu table size calculation hook for small tables powerpc/powernv: Check kzalloc() return value in pnv_pci_table_alloc powerpc: Add arch/powerpc/tools directory ...
2017-05-03powerpc/mm/radix: Drop support for CPUs without lockless tlbieMichael Ellerman
Currently the radix TLB code includes support for CPUs that do *not* have MMU_FTR_LOCKLESS_TLBIE. On those CPUs we are required to take a global spinlock before issuing a tlbie. Radix can only be built for 64-bit Book3s CPUs, and of those, only POWER4, 970, Cell and PA6T do not have MMU_FTR_LOCKLESS_TLBIE. Although it's possible to build a kernel with Radix support that can also boot on those CPUs, we happen to know that in reality none of those CPUs support the Radix MMU, so the code can never actually run on those CPUs. So remove the native_tlbie_lock in the Radix TLB code. Note that there is another lock of the same name in the hash code, which is unaffected by this patch. Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28Merge branch 'topic/ppc-kvm' into nextMichael Ellerman
Merge the topic branch we were sharing with kvm-ppc, Paul has also merged it.
2017-04-27powerpc/mm: Rename table dump file nameChristophe Leroy
Page table dump debugfs file is named 'kernel_page_tables' on all other architectures implementing it, while is is named 'kernel_pagetables' on powerpc. This patch renames it. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27powerpc/mm: On PPC32, display 32 bits addresses in page table dumpChristophe Leroy
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27powerpc/mm: Fix missing page attributes in page table dumpChristophe Leroy
On some targets, _PAGE_RW is 0 and this is _PAGE_RO which is used. There is also _PAGE_SHARED that is missing. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27powerpc/mm: Fix page table dump build on PPC32Christophe Leroy
On PPC32 (eg. mpc885_ads_defconfig), page table dump compilation fails as follows. This is because the memory layout is slightly different on PPC32. This patch adapts it. arch/powerpc/mm/dump_linuxpagetables.c: In function 'walk_pagetables': arch/powerpc/mm/dump_linuxpagetables.c:369:10: error: 'KERN_VIRT_START' undeclared (first use in this function) ... Fixes: 8eb07b187000d ("powerpc/mm: Dump linux pagetables") Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27powerpc/mm/radix: Optimise tlbiel flush all caseAneesh Kumar K.V
_tlbiel_pid() is called with a ric (Radix Invalidation Control) argument of either RIC_FLUSH_TLB or RIC_FLUSH_ALL. RIC_FLUSH_ALL says to invalidate the entire TLB and the Page Walk Cache (PWC). To flush the whole TLB, we have to iterate over each set (congruence class) of the TLB. Currently we do that and pass RIC_FLUSH_ALL each time. That is not incorrect but it means we flush the PWC 128 times, when once would suffice. Fix it by doing the first flush with the ric value we're passed, and then if it was RIC_FLUSH_ALL, we downgrade it to RIC_FLUSH_TLB, because we know we have just flushed the PWC and don't need to do it again. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Split out of combined patch, tweak logic, rewrite change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27powerpc/mm/radix: Optimise Page Walk Cache flushAneesh Kumar K.V
Currently we implement flushing of the page walk cache (PWC) by calling _tlbiel_pid() with a RIC (Radix Invalidation Control) value of 1 which says to only flush the PWC. But _tlbiel_pid() loops over each set (congruence class) of the TLB, which is not necessary when we're just flushing the PWC. In fact the set argument is ignored for a PWC flush, so essentially we're just flushing the PWC 127 extra times for no benefit. Fix it by adding tlbiel_pwc() which just does a single flush of the PWC. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Split out of combined patch, drop _ in name, rewrite change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-26powerpc/mm: Fix possible out-of-bounds shift in arch_mmap_rnd()Michael Ellerman
The recent patch to add runtime configuration of the ASLR limits added a bug in arch_mmap_rnd() where we may shift an integer (32-bits) by up to 33 bits, leading to undefined behaviour. In practice it exhibits as every process seg faulting instantly, presumably because the rnd value hasn't been restricited by the modulus at all. We didn't notice because it only happens under certain kernel configurations and if the number of bits is actually set to a large value. Fix it by switching to unsigned long. Fixes: 9fea59bd7ca5 ("powerpc/mm: Add support for runtime configuration of ASLR limits") Reported-by: Balbir Singh <bsingharora@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-21powerpc/mm: Add support for runtime configuration of ASLR limitsMichael Ellerman
Add powerpc support for mmap_rnd_bits and mmap_rnd_compat_bits, which are two sysctls that allow a user to configure the number of bits of randomness used for ASLR. Because of the way the Kconfig for ARCH_MMAP_RND_BITS is defined, we have to construct at least the MIN value in Kconfig, vs in a header which would be more natural. Given that we just go ahead and do it all in Kconfig. At least according to the code (the documentation makes no mention of it), the value is defined as the number of bits of randomisation *of the page*, not the address. This makes some sense, with larger page sizes more of the low bits are forced to zero, which would reduce the randomisation if we didn't take the PAGE_SIZE into account. However it does mean the min/max values have to change depending on the PAGE_SIZE in order to actually limit the amount of address space consumed by the randomisation. The result of that is that we have to define the default values based on both 32-bit vs 64-bit, but also the configured PAGE_SIZE. Furthermore now that we have 128TB address space support on Book3S, we also have to take that into account. Finally we can wire up the value in arch_mmap_rnd(). Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com> Tested-by: Bhupesh Sharma <bhsharma@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2017-04-19powerpc/iommu: Do not call PageTransHuge() on tail pagesAlexey Kardashevskiy
The CMA pages migration code does not support compound pages at the moment so it performs few tests before proceeding to actual page migration. One of the tests - PageTransHuge() - has VM_BUG_ON_PAGE(PageTail()) as it is designed to be called on head pages only. Since we also test for PageCompound(), and it contains PageTail() and PageHead(), we can simplify the check by leaving just PageCompound() and therefore avoid possible VM_BUG_ON_PAGE. Fixes: 2e5bbb5461f1 ("KVM: PPC: Book3S HV: Migrate pinned pages out of CMA") Cc: stable@vger.kernel.org # v4.9+ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-19powerpc/mmap: Any hint > 128TB searches the full VA spaceAneesh Kumar K.V
As part of the new large address space support, processes start out life with a 128TB virtual address space. However when calling mmap() a process can pass a hint address, and if that hint is > 128TB the kernel will use the full 512TB address space to try and satisfy the mmap() request. Currently we have a check that the hint is > 128TB and < 512TB (TASK_SIZE), which was added as an optimisation to avoid updating addr_limit unnecessarily and also to avoid calling slice_flush_segments() on all CPUs more than necessary. However this has the user-visible side effect that an mmap() hint above 512TB does not search the full address space unless a preceding mmap() used a hint value > 128TB && < 512TB. So fix it to treat any hint above 128TB as a hint to search the full address space, instead of checking the hint against TASK_SIZE, we instead check if the addr_limit is already == TASK_SIZE. This also brings the ABI in-line with what is proposed on x86. ie, that a hint address above 128TB up to and including (2^64)-1 is an indication to search the full address space. Fixes: f4ea6dcb08ea2c (powerpc/mm: Enable mappings above 128TB) Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-19powerpc/mm/radix: Use mm->task_size for boundary checking instead of addr_limitAneesh Kumar K.V
We don't init addr_limit correctly for 32 bit applications. So default to using mm->task_size for boundary condition checking. We use addr_limit to only control free space search. This makes sure that we do the right thing with 32 bit applications. We should consolidate the usage of TASK_SIZE/mm->task_size and mm->context.addr_limit later. This partially reverts commit fbfef9027c2a7ad (powerpc/mm: Switch some TASK_SIZE checks to use mm_context addr_limit). Fixes: fbfef9027c2a ("powerpc/mm: Switch some TASK_SIZE checks to use mm_context addr_limit") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-13powerpc/mm/hash: Don't open code VMALLOC_INDEXAneesh Kumar K.V
We have a #define for it, so use it. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-12powerpc/mm: Fix hash table dump when memory is not contiguousRashmica Gupta
The current behaviour of the hash table dump assumes that memory is contiguous and iterates from the start of memory to (start + size of memory). When memory isn't physically contiguous, this doesn't work. If memory exists at 0-5 GB and 6-10 GB then the current approach will check if entries exist in the hash table from 0GB to 9GB. This patch changes the behaviour to iterate over any holes up to the end of memory. Fixes: 1515ab932156 ("powerpc/mm: Dump hash table") Signed-off-by: Rashmica Gupta <rashmica.g@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-12powerpc/mm: Add physical address to Linux page table dumpOliver O'Halloran
The current page table dumper scans the Linux page tables and coalesces mappings with adjacent virtual addresses and similar PTE flags. This behaviour is somewhat broken when you consider the IOREMAP space where entirely unrelated mappings will appear to be virtually contiguous. This patch modifies the range coalescing so that only ranges that are both physically and virtually contiguous are combined. This patch also adds to the dump output the physical address at the start of each range. Fixes: 8eb07b187000 ("powerpc/mm: Dump linux pagetables") Signed-off-by: Oliver O'Halloran <oohall@gmail.com> [mpe: Print the physicall address with 0x like the other addresses] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-12powerpc/mm: Fix missing _PAGE_NON_IDEMPOTENT in pgtable dumpOliver O'Halloran
On Book3s we have two PTE flags used to mark cache-inhibited mappings: _PAGE_TOLERANT and _PAGE_NON_IDEMPOTENT. Currently the kernel page table dumper only looks at the generic _PAGE_NO_CACHE which is defined to be _PAGE_TOLERANT. This patch modifies the dumper so both flags are shown in the dump. Fixes: 8eb07b187000 ("powerpc/mm: Dump linux pagetables") Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-11powerpc/mm: Remove reduntant initmem information from logAnshuman Khandual
Generic core VM already prints these information in the log buffer, hence there is no need for a second print. This just removes the second print from arch powerpc NUMA init path. Before the patch: $ dmesg | grep "Initmem" numa: Initmem setup node 0 [mem 0x00000000-0xffffffff] numa: Initmem setup node 1 [mem 0x100000000-0x1ffffffff] numa: Initmem setup node 2 [mem 0x200000000-0x2ffffffff] numa: Initmem setup node 3 [mem 0x300000000-0x3ffffffff] numa: Initmem setup node 4 [mem 0x400000000-0x4ffffffff] numa: Initmem setup node 5 [mem 0x500000000-0x5ffffffff] numa: Initmem setup node 6 [mem 0x600000000-0x6ffffffff] numa: Initmem setup node 7 [mem 0x700000000-0x7ffffffff] Initmem setup node 0 [mem 0x0000000000000000-0x00000000ffffffff] Initmem setup node 1 [mem 0x0000000100000000-0x00000001ffffffff] Initmem setup node 2 [mem 0x0000000200000000-0x00000002ffffffff] Initmem setup node 3 [mem 0x0000000300000000-0x00000003ffffffff] Initmem setup node 4 [mem 0x0000000400000000-0x00000004ffffffff] Initmem setup node 5 [mem 0x0000000500000000-0x00000005ffffffff] Initmem setup node 6 [mem 0x0000000600000000-0x00000006ffffffff] Initmem setup node 7 [mem 0x0000000700000000-0x00000007ffffffff] After the patch just the latter set is printed. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-11powerpc/nohash: Fix use of mmu_has_feature() in setup_initial_memory_limit()Michael Ellerman
setup_initial_memory_limit() is called from early_init_devtree(), which runs prior to feature patching. If the kernel is built with CONFIG_JUMP_LABEL=y and CONFIG_JUMP_LABEL_FEATURE_CHECKS=y then we will potentially get the wrong value. If we also have CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG=y we get a warning and backtrace: Warning! mmu_has_feature() used prior to jump label init! CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc4-gccN-next-20170331-g6af2434 #1 Call Trace: [c000000000fc3d50] [c000000000a26c30] .dump_stack+0xa8/0xe8 (unreliable) [c000000000fc3de0] [c00000000002e6b8] .setup_initial_memory_limit+0xa4/0x104 [c000000000fc3e60] [c000000000d5c23c] .early_init_devtree+0xd0/0x2f8 [c000000000fc3f00] [c000000000d5d3b0] .early_setup+0x90/0x11c [c000000000fc3f90] [c000000000000520] start_here_multiplatform+0x68/0x80 Fix it by using early_mmu_has_feature(). Fixes: c12e6f24d413 ("powerpc: Add option to use jump label for mmu_has_feature()") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-11powerpc: Create asm/debugfs.h and move powerpc_debugfs_root thereMichael Ellerman
powerpc_debugfs_root is the dentry representing the root of the "powerpc" directory tree in debugfs. Currently it sits in asm/debug.h, a long with some other things that have "debug" in the name, but are otherwise unrelated. Pull it out into a separate header, which also includes linux/debugfs.h, and convert all the users to include debugfs.h instead of debug.h. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-11powerpc/mm/radix: Remove unnecessary ptesyncAneesh Kumar K.V
For a tlbiel with pid, we need to issue tlbiel with set number encoded. We don't need to do ptesync for each of those. Instead we need one for the entire tlbiel pid operation. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-11powerpc/mm/radix: Don't do page walk cache flush when doing full mm flushAneesh Kumar K.V
For fullmm tlb flush, we do a flush with RIC_FLUSH_ALL which will invalidate all related caches (radix__tlb_flush()). Hence the pwc flush is not needed. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-05powerpc/mm: Add missing global TLB invalidate if cxl is activeFrederic Barrat
Commit 4c6d9acce1f4 ("powerpc/mm: Add hooks for cxl") converted local TLB invalidates to global if the cxl driver is active. This is necessary because the CAPP snoops invalidations to forward them to the PSL on the cxl adapter. However one path was forgotten. native_flush_hash_range() still does local TLB invalidates, as found out the hard way recently. This patch fixes it by following the same logic as previously: if the cxl driver is active, the local TLB invalidates are 'upgraded' to global. Fixes: 4c6d9acce1f4 ("powerpc/mm: Add hooks for cxl") Cc: stable@vger.kernel.org # v3.18+ Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-04powerpc/powernv: Introduce address translation services for Nvlink2Alistair Popple
Nvlink2 supports address translation services (ATS) allowing devices to request address translations from an mmu known as the nest MMU which is setup to walk the CPU page tables. To access this functionality certain firmware calls are required to setup and manage hardware context tables in the nvlink processing unit (NPU). The NPU also manages forwarding of TLB invalidates (known as address translation shootdowns/ATSDs) to attached devices. This patch exports several methods to allow device drivers to register a process id (PASID/PID) in the hardware tables and to receive notification of when a device should stop issuing address translation requests (ATRs). It also adds a fault handler to allow device drivers to demand fault pages in. Signed-off-by: Alistair Popple <alistair@popple.id.au> [mpe: Fix up comment formatting, use flush_tlb_mm()] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-03powerpc/mm: Remove stale comment about the DART holeOliver O'Halloran
The code to fix the problem it describes was removed in commit c40785ad305b ("powerpc/dart: Use a cachable DART"), and it uses the stupid comment style. Away it goooooooooooooes! Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-03powerpc: Avoid taking a data miss on every userspace instruction missAnton Blanchard
Early on in do_page_fault() we call store_updates_sp(), regardless of the type of exception. For an instruction miss this doesn't make sense, because we only use this information to detect if a data miss is the result of a stack expansion instruction or not. Worse still, it results in a data miss within every userspace instruction miss handler, because we try and load the very instruction we are about to install a pte for! A simple exec microbenchmark runs 6% faster on POWER8 with this fix: #include <stdlib.h> #include <stdio.h> #include <unistd.h> int main(int argc, char *argv[]) { unsigned long left = atol(argv[1]); char leftstr[16]; if (left-- == 0) return 0; sprintf(leftstr, "%ld", left); execlp(argv[0], argv[0], leftstr, NULL); perror("exec failed\n"); return 0; } Pass the number of iterations on the command line (eg 10000) and time how long it takes to execute. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-01powerpc/mm: Enable mappings above 128TBAneesh Kumar K.V
Not all user space application is ready to handle wide addresses. It's known that at least some JIT compilers use higher bits in pointers to encode their information. It collides with valid pointers with 512TB addresses and leads to crashes. To mitigate this, we are not going to allocate virtual address space above 128TB by default. But userspace can ask for allocation from full address space by specifying hint address (with or without MAP_FIXED) above 128TB. If hint address set above 128TB, but MAP_FIXED is not specified, we try to look for unmapped area by specified address. If it's already occupied, we look for unmapped area in *full* address space, rather than from 128TB window. This approach helps to easily make application's memory allocator aware about large address space without manually tracking allocated virtual address space. This is going to be a per mmap decision. ie, we can have some mmaps with larger addresses and other that do not. A sample memory layout looks like: 10000000-10010000 r-xp 00000000 fc:00 9057045 /home/max_addr_512TB 10010000-10020000 r--p 00000000 fc:00 9057045 /home/max_addr_512TB 10020000-10030000 rw-p 00010000 fc:00 9057045 /home/max_addr_512TB 10029630000-10029660000 rw-p 00000000 00:00 0 [heap] 7fff834a0000-7fff834b0000 rw-p 00000000 00:00 0 7fff834b0000-7fff83670000 r-xp 00000000 fc:00 9177190 /lib/powerpc64le-linux-gnu/libc-2.23.so 7fff83670000-7fff83680000 r--p 001b0000 fc:00 9177190 /lib/powerpc64le-linux-gnu/libc-2.23.so 7fff83680000-7fff83690000 rw-p 001c0000 fc:00 9177190 /lib/powerpc64le-linux-gnu/libc-2.23.so 7fff83690000-7fff836a0000 rw-p 00000000 00:00 0 7fff836a0000-7fff836c0000 r-xp 00000000 00:00 0 [vdso] 7fff836c0000-7fff83700000 r-xp 00000000 fc:00 9177193 /lib/powerpc64le-linux-gnu/ld-2.23.so 7fff83700000-7fff83710000 r--p 00030000 fc:00 9177193 /lib/powerpc64le-linux-gnu/ld-2.23.so 7fff83710000-7fff83720000 rw-p 00040000 fc:00 9177193 /lib/powerpc64le-linux-gnu/ld-2.23.so 7fffdccf0000-7fffdcd20000 rw-p 00000000 00:00 0 [stack] 1000000000000-1000000010000 rw-p 00000000 00:00 0 1ffff83710000-1ffff83720000 rw-p 00000000 00:00 0 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-01powerpc/mm: Switch some TASK_SIZE checks to use mm_context addr_limitAneesh Kumar K.V
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-01powerpc/pseries: Skip using reserved virtual address rangeAneesh Kumar K.V
Now that we use all the available virtual address range, we need to make sure we don't generate VSID such that it overlaps with the reserved vsid range. Reserved vsid range include the virtual address range used by the adjunct partition and also the VRMA virtual segment. We find the context value that can result in generating such a VSID and reserve it early in boot. We don't look at the adjunct range, because for now we disable the adjunct usage in a Linux LPAR via CAS interface. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Rewrite hash__reserve_context_id(), move the rest into pseries] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-01powerpc/mm/hash: Store addr_limit in PACAAneesh Kumar K.V
We optmize the slice page size array copy to paca by copying only the range based on addr_limit. This will require us to not look at page size array beyond addr_limit in PACA on slb fault. To enable that copy task size to paca which will be used during slb fault. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Rename from task_size to addr_limit, consolidate #ifdefs] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-01powerpc/mm: Add addr_limit to mm_context and use it to derive max slice indexAneesh Kumar K.V
In the followup patch, we will increase the slice array size to handle 512TB range, but will limit the max addr to 128TB. Avoid doing unnecessary computation and avoid doing slice mask related operation above address limit. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm/hash: Support 68 bit VAAneesh Kumar K.V
Inorder to support large effective address range (512TB), we want to increase the virtual address bits to 68. But we do have platforms like p4 and p5 that can only do 65 bit VA. We support those platforms by limiting context bits on them to 16. The protovsid -> vsid conversion is verified to work with both 65 and 68 bit va values. I also documented the restrictions in a table format as part of code comments. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm/hash: Use context ids 1-4 for the kernelAneesh Kumar K.V
Currently we use the top 4 context ids (0x7fffc-0x7ffff) for the kernel. Kernel VSIDs are built using these top context values and effective the segement ID. In subsequent patches we want to increase the max effective address to 512TB. We will achieve that by increasing the effective segment IDs there by increasing virtual address range. We will be switching to a 68bit virtual address in the following patch. But platforms like Power4 and Power5 only support a 65 bit virtual address. We will handle that by limiting the context bits to 16 instead of 19 on those platforms. That means the max context id will have a different value on different platforms. So that we don't have to deal with the kernel context ids changing between different platforms, move the kernel context ids down to use context ids 1-4. We can't use segment 0 of context-id 0, because that maps to VSID 0, which we want to keep as invalid, so we avoid context-id 0 entirely. Similarly we can't use the last segment of the maximum context, so we avoid it too. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Switch from 0-3 to 1-4 so VSID=0 remains invalid] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm: Split radix vs hash mm context initialisationMichael Ellerman
Complete the split of the radix vs hash mm context initialisation. This is mostly code movement, with the exception that we now limit the context allocation to PRTB_ENTRIES - 1 on radix. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm/hash: Pull hash constants into hash__alloc_context_id()Michael Ellerman
The min and max context id values used in alloc_context_id() are currently the right values for use on hash, and happen to also be safe for use on radix. But we need to change that in a subsequent patch, so make the min/max ids parameters and pull the hash values into hsah__alloc_context_id(). Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm/hash: Abstract context id allocation for KVMMichael Ellerman
KVM wants to be able to allocate an MMU context id, which it does currently by calling __init_new_context(). We're about to rework that code, so provide a wrapper for KVM so it can not worry about the details. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm/slice: Update slice mask printing to use bitmap printing.Aneesh Kumar K.V
We now get output like below which is much better. [ 0.935306] good_mask low_slice: 0-15 [ 0.935360] good_mask high_slice: 0-511 Compared to [ 0.953414] good_mask:1111111111111111 - 1111111111111......... I also fixed an error with slice_dbg printing. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm/slice: Move slice_mask struct definition to slice.cAneesh Kumar K.V
This structure definition need not be in a header since this is used only by slice.c file. So move it to slice.c. This also allow us to use SLICE_NUM_HIGH instead of 64. I also switch the low_slices type to u64 from u16. This doesn't have an impact on size of struct due to padding added with u16 type. This helps in using bitmap printing function for printing slice mask. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm: Remove checks that TASK_SIZE_USER64 is too smallAneesh Kumar K.V
Remove the checks that TASK_SIZE_USER64 is smaller than H_PGTABLE_RANGE and USER_VSID_RANGE. In a following patch we will deliberately add support for a TASK_SIZE smaller than both ranges, so this will no longer be an error condition. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Keep the check in pgtable_64.c that we don't exceed USER_VSID_RANGE] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm: Move copy_mm_to_paca to paca.cAneesh Kumar K.V
We also update the function arg to struct mm_struct. Move this so that function finds the definition of struct mm_struct. No functional change in this patch. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm/slice: Update the function prototypeAneesh Kumar K.V
This avoid copying the slice_mask struct as function return value Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm/slice: Convert slice_mask high slice to a bitmapAneesh Kumar K.V
In followup patch we want to increase the va range which will result in us requiring high_slices to have more than 64 bits. To enable this convert high_slices to bitmap. We keep the number bits same in this patch and later change that to higher value Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Fold in fix to use bitmap_empty()] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm: Move hash specific pte bits to be top bits of RPNAneesh Kumar K.V
We don't support the full 57 bits of physical address and hence can overload the top bits of RPN as hash specific pte bits. Add a BUILD_BUG_ON() to enforce the relationship between H_PAGE_F_SECOND and H_PAGE_F_GIX. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Paul Mackerras <paulus@ozlabs.org> [mpe: Move the BUILD_BUG_ON() into hash_utils_64.c and comment it] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm/hugetlb: Filter out hugepage size not supported by page table layoutAneesh Kumar K.V
Without this if firmware reports 1MB page size support we will crash trying to use 1MB as hugetlb page size. echo 300 > /sys/kernel/mm/hugepages/hugepages-1024kB/nr_hugepages kernel BUG at ./arch/powerpc/include/asm/hugetlb.h:19! ..... .... [c0000000e2c27b30] c00000000029dae8 .hugetlb_fault+0x638/0xda0 [c0000000e2c27c30] c00000000026fb64 .handle_mm_fault+0x844/0x1d70 [c0000000e2c27d70] c00000000004805c .do_page_fault+0x3dc/0x7c0 [c0000000e2c27e30] c00000000000ac98 handle_page_fault+0x10/0x30 With fix, we don't enable 1MB as hugepage size. bash-4.2# cd /sys/kernel/mm/hugepages/ bash-4.2# ls hugepages-16384kB hugepages-16777216kB Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm/radix: rename _PAGE_LARGE to R_PAGE_LARGEAneesh Kumar K.V
This bit is only used by radix and it is nice to follow the naming style of having bit name start with H_/R_ depending on which translation mode they are used. No functional change in this patch. Reviewed-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm/slice: Fix off-by-1 error when computing slice maskAneesh Kumar K.V
For low slice, max addr should be less than 4G. Without limiting this correctly we will end up with a low slice mask which has 17th bit set. This is not a problem with the current code because our low slice mask is of type u16. But in later patch I am switching low slice mask to u64 type and having the 17bit set result in wrong slice mask which in turn results in mmap failures. Reviewed-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm/nohash: MM_SLICE is only used by book3s 64Aneesh Kumar K.V
BOOKE code is dead code as per the Kconfig details. So make it simpler by enabling MM_SLICE only for book3s_64. The changes w.r.t nohash is just removing deadcode. W.r.t ppc64, 4k without hugetlb will now enable MM_SLICE. But that is good, because we reduce one extra variant which probably is not getting tested much. Reviewed-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>