summaryrefslogtreecommitdiffstats
path: root/arch
AgeCommit message (Collapse)Author
2019-09-24Merge branch 'work.mount3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more mount API conversions from Al Viro: "Assorted conversions of options parsing to new API. gfs2 is probably the most serious one here; the rest is trivial stuff. Other things in what used to be #work.mount are going to wait for the next cycle (and preferably go via git trees of the filesystems involved)" * 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: gfs2: Convert gfs2 to fs_context vfs: Convert spufs to use the new mount API vfs: Convert hypfs to use the new mount API hypfs: Fix error number left in struct pointer member vfs: Convert functionfs to use the new mount API vfs: Convert bpf to use the new mount API
2019-09-24ia64: Fix some warnings introduced in merge windowTony Luck
Fix arch/ia64/kernel/irq_ia64.c:586:1: warning: no return statement in function returning non-void [-Wreturn-type] arch/ia64/mm/contig.c:111:6: warning: unused variable 'rc' [-Wunused-variable] arch/ia64/mm/discontig.c:189:39: warning: unused variable 'rc' [-Wunused-variable] Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24libnvdimm/altmap: Track namespace boundaries in altmapAneesh Kumar K.V
With PFN_MODE_PMEM namespace, the memmap area is allocated from the device area. Some architectures map the memmap area with large page size. On architectures like ppc64, 16MB page for memap mapping can map 262144 pfns. This maps a namespace size of 16G. When populating memmap region with 16MB page from the device area, make sure the allocated space is not used to map resources outside this namespace. Such usage of device area will prevent a namespace destroy. Add resource end pnf in altmap and use that to check if the memmap area allocation can map pfn outside the namespace. On ppc64 in such case we fallback to allocation from memory. This fix kernel crash reported below: [ 132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133 devm_memremap_pages_release+0x2d8/0x2e0 [ 133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000 [ 133.464760] Faulting instruction address: 0xc00000000007580c [ 133.464766] Oops: Kernel access of bad area, sig: 11 [#1] [ 133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries ..... [ 133.464901] NIP [c00000000007580c] vmemmap_free+0x2ac/0x3d0 [ 133.464906] LR [c0000000000757f8] vmemmap_free+0x298/0x3d0 [ 133.464910] Call Trace: [ 133.464914] [c000007cbfd0f7b0] [c0000000000757f8] vmemmap_free+0x298/0x3d0 (unreliable) [ 133.464921] [c000007cbfd0f8d0] [c000000000370a44] section_deactivate+0x1a4/0x240 [ 133.464928] [c000007cbfd0f980] [c000000000386270] __remove_pages+0x3a0/0x590 [ 133.464935] [c000007cbfd0fa50] [c000000000074158] arch_remove_memory+0x88/0x160 [ 133.464942] [c000007cbfd0fae0] [c0000000003be8c0] devm_memremap_pages_release+0x150/0x2e0 [ 133.464949] [c000007cbfd0fb70] [c000000000738ea0] devm_action_release+0x30/0x50 [ 133.464955] [c000007cbfd0fb90] [c00000000073a5a4] release_nodes+0x344/0x400 [ 133.464961] [c000007cbfd0fc40] [c00000000073378c] device_release_driver_internal+0x15c/0x250 [ 133.464968] [c000007cbfd0fc80] [c00000000072fd14] unbind_store+0x104/0x110 [ 133.464973] [c000007cbfd0fcd0] [c00000000072ee24] drv_attr_store+0x44/0x70 [ 133.464981] [c000007cbfd0fcf0] [c0000000004a32bc] sysfs_kf_write+0x6c/0xa0 [ 133.464987] [c000007cbfd0fd10] [c0000000004a1dfc] kernfs_fop_write+0x17c/0x250 [ 133.464993] [c000007cbfd0fd60] [c0000000003c348c] __vfs_write+0x3c/0x70 [ 133.464999] [c000007cbfd0fd80] [c0000000003c75d0] vfs_write+0xd0/0x250 djbw: Aneesh notes that this crash can likely be triggered in any kernel that supports 'papr_scm', so flagging that commit for -stable consideration. Fixes: b5beae5e224f ("powerpc/pseries: Add driver for PAPR SCM regions") Cc: <stable@vger.kernel.org> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Pankaj Gupta <pagupta@redhat.com> Tested-by: Santosh Sivaraj <santosh@fossix.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Link: https://lore.kernel.org/r/20190910062826.10041-1-aneesh.kumar@linux.ibm.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-09-24powerpc/book3s64: Export has_transparent_hugepage() related functions.Aneesh Kumar K.V
In later patch, we want to use hash_transparent_hugepage() in a kernel module. Export two related functions. Acked-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/20190924042440.27946-1-aneesh.kumar@linux.ibm.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-09-24kvm: nvmx: limit atomic switch MSRsMarc Orr
Allowing an unlimited number of MSRs to be specified via the VMX load/store MSR lists (e.g., vm-entry MSR load list) is bad for two reasons. First, a guest can specify an unreasonable number of MSRs, forcing KVM to process all of them in software. Second, the SDM bounds the number of MSRs allowed to be packed into the atomic switch MSR lists. Quoting the "Miscellaneous Data" section in the "VMX Capability Reporting Facility" appendix: "Bits 27:25 is used to compute the recommended maximum number of MSRs that should appear in the VM-exit MSR-store list, the VM-exit MSR-load list, or the VM-entry MSR-load list. Specifically, if the value bits 27:25 of IA32_VMX_MISC is N, then 512 * (N + 1) is the recommended maximum number of MSRs to be included in each list. If the limit is exceeded, undefined processor behavior may result (including a machine check during the VMX transition)." Because KVM needs to protect itself and can't model "undefined processor behavior", arbitrarily force a VM-entry to fail due to MSR loading when the MSR load list is too large. Similarly, trigger an abort during a VM exit that encounters an MSR load list or MSR store list that is too large. The MSR list size is intentionally not pre-checked so as to maintain compatibility with hardware inasmuch as possible. Test these new checks with the kvm-unit-test "x86: nvmx: test max atomic switch MSRs". Suggested-by: Jim Mattson <jmattson@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Signed-off-by: Marc Orr <marcorr@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24kvm: svm: Intercept RDPRUJim Mattson
The RDPRU instruction gives the guest read access to the IA32_APERF MSR and the IA32_MPERF MSR. According to volume 3 of the APM, "When virtualization is enabled, this instruction can be intercepted by the Hypervisor. The intercept bit is at VMCB byte offset 10h, bit 14." Since we don't enumerate the instruction in KVM_SUPPORTED_CPUID, intercept it and synthesize #UD. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Drew Schmitt <dasch@google.com> Reviewed-by: Jacob Xu <jacobhxu@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24kvm: x86: Add "significant index" flag to a few CPUID leavesJim Mattson
According to the Intel SDM, volume 2, "CPUID," the index is significant (or partially significant) for CPUID leaves 0FH, 10H, 12H, 17H, 18H, and 1FH. Add the corresponding flag to these CPUID leaves in do_host_cpuid(). Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Steve Rutherford <srutherford@google.com> Fixes: a87f2d3a6eadab ("KVM: x86: Add Intel CPUID.1F cpuid emulation support") Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Skip invalid pages during zapping iff root_count is zeroSean Christopherson
Do not skip invalid shadow pages when zapping obsolete pages if the pages' root_count has reached zero, in which case the page can be immediately zapped and freed. Update the comment accordingly. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Explicitly track only a single invalid mmu generationSean Christopherson
Toggle mmu_valid_gen between '0' and '1' instead of blindly incrementing the generation. Because slots_lock is held for the entire duration of zapping obsolete pages, it's impossible for there to be multiple invalid generations associated with shadow pages at any given time. Toggling between the two generations (valid vs. invalid) allows changing mmu_valid_gen from an unsigned long to a u8, which reduces the size of struct kvm_mmu_page from 160 to 152 bytes on 64-bit KVM, i.e. reduces KVM's memory footprint by 8 bytes per shadow page. Set sp->mmu_valid_gen before it is added to active_mmu_pages. Functionally this has no effect as kvm_mmu_alloc_page() has a single caller that sets sp->mmu_valid_gen soon thereafter, but visually it is jarring to see a shadow page being added to the list without its mmu_valid_gen first being set. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Revert "KVM: x86/mmu: Remove is_obsolete() call"Sean Christopherson
Now that the fast invalidate mechanism has been reintroduced, restore the performance tweaks for fast invalidation that existed prior to its removal. Paraphrasing the original changelog (commit 5ff0568374ed2 was itself a partial revert): Don't force reloading the remote mmu when zapping an obsolete page, as a MMU_RELOAD request has already been issued by kvm_mmu_zap_all_fast() immediately after incrementing mmu_valid_gen, i.e. after marking pages obsolete. This reverts commit 5ff0568374ed2e585376a3832857ade5daccd381. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Revert "Revert "KVM: MMU: reclaim the zapped-obsolete page first""Sean Christopherson
Now that the fast invalidate mechanism has been reintroduced, restore the performance tweaks for fast invalidation that existed prior to its removal. Paraphrashing the original changelog: Introduce a per-VM list to track obsolete shadow pages, i.e. pages which have been deleted from the mmu cache but haven't yet been freed. When page reclaiming is needed, zap/free the deleted pages first. This reverts commit 52d5dedc79bdcbac2976159a172069618cf31be5. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Revert "Revert "KVM: MMU: collapse TLB flushes when zap all ↵Sean Christopherson
pages"" Now that the fast invalidate mechanism has been reintroduced, restore the performance tweaks for fast invalidation that existed prior to its removal. Paraphrashing the original changelog: Reload the mmu on all vCPUs after updating the generation number so that obsolete pages are not used by any vCPUs. This allows collapsing all TLB flushes during obsolete page zapping into a single flush, as there is no need to flush when dropping mmu_lock (to reschedule). Note: a remote TLB flush is still needed before freeing the pages as other vCPUs may be doing a lockless shadow page walk. Opportunstically improve the comments restored by the revert (the code itself is a true revert). This reverts commit f34d251d66ba263c077ed9d2bbd1874339a4c887. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Revert "Revert "KVM: MMU: zap pages in batch""Sean Christopherson
Now that the fast invalidate mechanism has been reintroduced, restore the performance tweaks for fast invalidation that existed prior to its removal. Paraphrashing the original changelog: Zap at least 10 shadow pages before releasing mmu_lock to reduce the overhead associated with re-acquiring the lock. Note: "10" is an arbitrary number, speculated to be high enough so that a vCPU isn't stuck zapping obsolete pages for an extended period, but small enough so that other vCPUs aren't starved waiting for mmu_lock. This reverts commit 43d2b14b105fb00b8864c7b0ee7043cc1cc4a969. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Revert "Revert "KVM: MMU: add tracepoint for ↵Sean Christopherson
kvm_mmu_invalidate_all_pages"" Now that the fast invalidate mechanism has been reintroduced, restore the tracepoint associated with said mechanism. Note, the name of the tracepoint deviates from the original tracepoint so as to match KVM's current nomenclature. This reverts commit 42560fb1f3c6c7f730897b7fa7a478bc37e0be50. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Revert "Revert "KVM: MMU: show mmu_valid_gen in shadow page ↵Sean Christopherson
related tracepoints"" Now that the fast invalidate mechanism has been reintroduced, restore tracing of the generation number in shadow page tracepoints. This reverts commit b59c4830ca185ba0e9f9e046fb1cd10a4a92627a. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Use fast invalidate mechanism to zap MMIO sptesSean Christopherson
Use the fast invalidate mechasim to zap MMIO sptes on a MMIO generation wrap. The fast invalidate flow was reintroduced to fix a livelock bug in kvm_mmu_zap_all() that can occur if kvm_mmu_zap_all() is invoked when the guest has live vCPUs. I.e. using kvm_mmu_zap_all() to handle the MMIO generation wrap is theoretically susceptible to the livelock bug. This effectively reverts commit 4771450c345dc ("Revert "KVM: MMU: drop kvm_mmu_zap_mmio_sptes""), i.e. restores the behavior of commit a8eca9dcc656a ("KVM: MMU: drop kvm_mmu_zap_mmio_sptes"). Note, this actually fixes commit 571c5af06e303 ("KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptes"), but there is no need to incrementally revert back to using fast invalidate, e.g. doing so doesn't provide any bisection or stability benefits. Fixes: 571c5af06e303 ("KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptes") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Treat invalid shadow pages as obsoleteSean Christopherson
Treat invalid shadow pages as obsolete to fix a bug where an obsolete and invalid page with a non-zero root count could become non-obsolete due to mmu_valid_gen wrapping. The bug is largely theoretical with the current code base, as an unsigned long will effectively never wrap on 64-bit KVM, and userspace would have to deliberately stall a vCPU in order to keep an obsolete invalid page on the active list while simultaneously modifying memslots billions of times to trigger a wrap. The obvious alternative is to use a 64-bit value for mmu_valid_gen, but it's actually desirable to go in the opposite direction, i.e. using a smaller 8-bit value to reduce KVM's memory footprint by 8 bytes per shadow page, and relying on proper treatment of invalid pages instead of preventing the generation from wrapping. Note, "Fixes" points at a commit that was at one point reverted, but has since been restored. Fixes: 5304b8d37c2a5 ("KVM: MMU: fast invalidate all pages") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: LAPIC: Tune lapic_timer_advance_ns smoothlyWanpeng Li
Filter out drastic fluctuation and random fluctuation, remove timer_advance_adjust_done altogether, the adjustment would be continuous. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: vmx: Introduce handle_unexpected_vmexit and handle WAITPKG vmexitTao Xu
As the latest Intel 64 and IA-32 Architectures Software Developer's Manual, UMWAIT and TPAUSE instructions cause a VM exit if the RDTSC exiting and enable user wait and pause VM-execution controls are both 1. Because KVM never enable RDTSC exiting, the vm-exit for UMWAIT and TPAUSE should never happen. Considering EXIT_REASON_XSAVES and EXIT_REASON_XRSTORS is also unexpected VM-exit for KVM. Introduce a common exit helper handle_unexpected_vmexit() to handle these unexpected VM-exit. Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Jingqi Liu <jingqi.liu@intel.com> Signed-off-by: Jingqi Liu <jingqi.liu@intel.com> Signed-off-by: Tao Xu <tao3.xu@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROLTao Xu
UMWAIT and TPAUSE instructions use 32bit IA32_UMWAIT_CONTROL at MSR index E1H to determines the maximum time in TSC-quanta that the processor can reside in either C0.1 or C0.2. This patch emulates MSR IA32_UMWAIT_CONTROL in guest and differentiate IA32_UMWAIT_CONTROL between host and guest. The variable mwait_control_cached in arch/x86/kernel/cpu/umwait.c caches the MSR value, so this patch uses it to avoid frequently rdmsr of IA32_UMWAIT_CONTROL. Co-developed-by: Jingqi Liu <jingqi.liu@intel.com> Signed-off-by: Jingqi Liu <jingqi.liu@intel.com> Signed-off-by: Tao Xu <tao3.xu@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Add support for user wait instructionsTao Xu
UMONITOR, UMWAIT and TPAUSE are a set of user wait instructions. This patch adds support for user wait instructions in KVM. Availability of the user wait instructions is indicated by the presence of the CPUID feature flag WAITPKG CPUID.0x07.0x0:ECX[5]. User wait instructions may be executed at any privilege level, and use 32bit IA32_UMWAIT_CONTROL MSR to set the maximum time. The behavior of user wait instructions in VMX non-root operation is determined first by the setting of the "enable user wait and pause" secondary processor-based VM-execution control bit 26. If the VM-execution control is 0, UMONITOR/UMWAIT/TPAUSE cause an invalid-opcode exception (#UD). If the VM-execution control is 1, treatment is based on the setting of the “RDTSC exiting†VM-execution control. Because KVM never enables RDTSC exiting, if the instruction causes a delay, the amount of time delayed is called here the physical delay. The physical delay is first computed by determining the virtual delay. If IA32_UMWAIT_CONTROL[31:2] is zero, the virtual delay is the value in EDX:EAX minus the value that RDTSC would return; if IA32_UMWAIT_CONTROL[31:2] is not zero, the virtual delay is the minimum of that difference and AND(IA32_UMWAIT_CONTROL,FFFFFFFCH). Because umwait and tpause can put a (psysical) CPU into a power saving state, by default we dont't expose it to kvm and enable it only when guest CPUID has it. Detailed information about user wait instructions can be found in the latest Intel 64 and IA-32 Architectures Software Developer's Manual. Co-developed-by: Jingqi Liu <jingqi.liu@intel.com> Signed-off-by: Jingqi Liu <jingqi.liu@intel.com> Signed-off-by: Tao Xu <tao3.xu@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Add comments to document various emulation typesSean Christopherson
Document the intended usage of each emulation type as each exists to handle an edge case of one kind or another and can be easily misinterpreted at first glance. Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: VMX: Handle single-step #DB for EMULTYPE_SKIP on EPT misconfigSean Christopherson
VMX's EPT misconfig flow to handle fast-MMIO path falls back to decoding the instruction to determine the instruction length when running as a guest (Hyper-V doesn't fill VMCS.VM_EXIT_INSTRUCTION_LEN because it's technically not defined for EPT misconfigs). Rather than implement the slow skip in VMX's generic skip_emulated_instruction(), handle_ept_misconfig() directly calls kvm_emulate_instruction() with EMULTYPE_SKIP, which intentionally doesn't do single-step detection, and so handle_ept_misconfig() misses a single-step #DB. Rework the EPT misconfig fallback case to route it through kvm_skip_emulated_instruction() so that single-step #DBs and interrupt shadow updates are handled automatically. I.e. make VMX's slow skip logic match SVM's and have the SVM flow not intentionally avoid the shadow update. Alternatively, the handle_ept_misconfig() could manually handle single- step detection, but that results in EMULTYPE_SKIP having split logic for the interrupt shadow vs. single-step #DBs, and split emulator logic is largely what led to this mess in the first place. Modifying SVM to mirror VMX flow isn't really an option as SVM's case isn't limited to a specific exit reason, i.e. handling the slow skip in skip_emulated_instruction() is mandatory for all intents and purposes. Drop VMX's skip_emulated_instruction() wrapper since it can now fail, and instead WARN if it fails unexpectedly, e.g. if exit_reason somehow becomes corrupted. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Fixes: d391f12070672 ("x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Remove emulation_result enums, EMULATE_{DONE,FAIL,USER_EXIT}Sean Christopherson
Deferring emulation failure handling (in some cases) to the caller of x86_emulate_instruction() has proven fragile, e.g. multiple instances of KVM not setting run->exit_reason on EMULATE_FAIL, largely due to it being difficult to discern what emulation types can return what result, and which combination of types and results are handled where. Now that x86_emulate_instruction() always handles emulation failure, i.e. EMULATION_FAIL is only referenced in callers, remove the emulation_result enums entirely. Per KVM's existing exit handling conventions, return '0' and '1' for "exit to userspace" and "resume guest" respectively. Doing so cleans up many callers, e.g. they can return kvm_emulate_instruction() directly instead of having to interpret its result. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: VMX: Remove EMULATE_FAIL handling in handle_invalid_guest_state()Sean Christopherson
Now that EMULATE_FAIL is completely unused, remove the last remaning usage where KVM does something functional in response to EMULATE_FAIL. Leave the check in place as a WARN_ON_ONCE to provide a better paper trail when EMULATE_{DONE,FAIL,USER_EXIT} are completely removed. Opportunistically remove the gotos in handle_invalid_guest_state(). With the EMULATE_FAIL handling gone there is no need to have a common handler for emulation failure and the gotos only complicate things, e.g. the signal_pending() check always returns '1', but this is far from obvious when glancing through the code. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Move triple fault request into RM int injectionSean Christopherson
Request triple fault in kvm_inject_realmode_interrupt() instead of returning EMULATE_FAIL and deferring to the caller. All existing callers request triple fault and it's highly unlikely Real Mode is going to acquire new features. While this consolidates a small amount of code, the real goal is to remove the last reference to EMULATE_FAIL. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Handle emulation failure directly in kvm_task_switch()Sean Christopherson
Consolidate the reporting of emulation failure into kvm_task_switch() so that it can return EMULATE_USER_EXIT. This helps pave the way for removing EMULATE_FAIL altogether. This also fixes a theoretical bug where task switch interception could suppress an EMULATE_USER_EXIT return. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Exit to userspace on emulation skip failureSean Christopherson
Kill a few birds with one stone by forcing an exit to userspace on skip emulation failure. This removes a reference to EMULATE_FAIL, fixes a bug in handle_ept_misconfig() where it would exit to userspace without setting run->exit_reason, and fixes a theoretical bug in SVM's task_switch_interception() where it would overwrite run->exit_reason on a return of EMULATE_USER_EXIT. Note, this technically doesn't fully fix task_switch_interception() as it now incorrectly handles EMULATE_FAIL, but in practice there is no bug as EMULATE_FAIL will never be returned for EMULTYPE_SKIP. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Move #UD injection for failed emulation into emulation codeSean Christopherson
Immediately inject a #UD and return EMULATE done if emulation fails when handling an intercepted #UD. This helps pave the way for removing EMULATE_FAIL altogether. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Add explicit flag for forced emulation on #UDSean Christopherson
Add an explicit emulation type for forced #UD emulation and use it to detect that KVM should unconditionally inject a #UD instead of falling into its standard emulation failure handling. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Move #GP injection for VMware into x86_emulate_instruction()Sean Christopherson
Immediately inject a #GP when VMware emulation fails and return EMULATE_DONE instead of propagating EMULATE_FAIL up the stack. This helps pave the way for removing EMULATE_FAIL altogether. Rename EMULTYPE_VMWARE to EMULTYPE_VMWARE_GP to document that the x86 emulator is called to handle VMware #GP interception, e.g. why a #GP is injected on emulation failure for EMULTYPE_VMWARE_GP. Drop EMULTYPE_NO_UD_ON_FAIL as a standalone type. The "no #UD on fail" is used only in the VMWare case and is obsoleted by having the emulator itself reinject #GP. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Don't attempt VMWare emulation on #GP with non-zero error codeSean Christopherson
The VMware backdoor hooks #GP faults on IN{S}, OUT{S}, and RDPMC, none of which generate a non-zero error code for their #GP. Re-injecting #GP instead of attempting emulation on a non-zero error code will allow a future patch to move #GP injection (for emulation failure) into kvm_emulate_instruction() without having to plumb in the error code. Reviewed-and-tested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Refactor kvm_vcpu_do_singlestep() to remove out paramSean Christopherson
Return the single-step emulation result directly instead of via an out param. Presumably at some point in the past kvm_vcpu_do_singlestep() could be called with *r==EMULATE_USER_EXIT, but that is no longer the case, i.e. all callers are happy to overwrite their own return variable. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Clean up handle_emulation_failure()Sean Christopherson
When handling emulation failure, return the emulation result directly instead of capturing it in a local variable. Future patches will move additional cases into handle_emulation_failure(), clean up the cruft before so there isn't an ugly mix of setting a local variable and returning directly. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Relocate MMIO exit stats countingSean Christopherson
Move the stat.mmio_exits update into x86_emulate_instruction(). This is both a bug fix, e.g. the current update flows will incorrectly increment mmio_exits on emulation failure, and a preparatory change to set the stage for eliminating EMULATE_DONE and company. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: nVMX: Check Host Address Space Size on vmentry of nested guestsKrish Sadhukhan
According to section "Checks Related to Address-Space Size" in Intel SDM vol 3C, the following checks are performed on vmentry of nested guests: If the logical processor is outside IA-32e mode (if IA32_EFER.LMA = 0) at the time of VM entry, the following must hold: - The "IA-32e mode guest" VM-entry control is 0. - The "host address-space size" VM-exit control is 0. If the logical processor is in IA-32e mode (if IA32_EFER.LMA = 1) at the time of VM entry, the "host address-space size" VM-exit control must be 1. If the "host address-space size" VM-exit control is 0, the following must hold: - The "IA-32e mode guest" VM-entry control is 0. - Bit 17 of the CR4 field (corresponding to CR4.PCIDE) is 0. - Bits 63:32 in the RIP field are 0. If the "host address-space size" VM-exit control is 1, the following must hold: - Bit 5 of the CR4 field (corresponding to CR4.PAE) is 1. - The RIP field contains a canonical address. On processors that do not support Intel 64 architecture, checks are performed to ensure that the "IA-32e mode guest" VM-entry control and the "host address-space size" VM-exit control are both 0. Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: hyper-v: set NoNonArchitecturalCoreSharing CPUID bit when SMT is ↵Vitaly Kuznetsov
impossible Hyper-V 2019 doesn't expose MD_CLEAR CPUID bit to guests when it cannot guarantee that two virtual processors won't end up running on sibling SMT threads without knowing about it. This is done as an optimization as in this case there is nothing the guest can do to protect itself against MDS and issuing additional flush requests is just pointless. On bare metal the topology is known, however, when Hyper-V is running nested (e.g. on top of KVM) it needs an additional piece of information: a confirmation that the exposed topology (wrt vCPU placement on different SMT threads) is trustworthy. NoNonArchitecturalCoreSharing (CPUID 0x40000004 EAX bit 18) is described in TLFS as follows: "Indicates that a virtual processor will never share a physical core with another virtual processor, except for virtual processors that are reported as sibling SMT threads." From KVM we can give such guarantee in two cases: - SMT is unsupported or forcefully disabled (just 'disabled' doesn't work as it can become re-enabled during the lifetime of the guest). - vCPUs are properly pinned so the scheduler won't put them on sibling SMT threads (when they're not reported as such). This patch reports NoNonArchitecturalCoreSharing bit in to userspace in the first case. The second case is outside of KVM's domain of responsibility (as vCPU pinning is actually done by someone who manages KVM's userspace - e.g. libvirt pinning QEMU threads). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: hyperv: Fix Direct Synthetic timers assert an interrupt w/o lapic_in_kernelWanpeng Li
Reported by syzkaller: kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN RIP: 0010:__apic_accept_irq+0x46/0x740 arch/x86/kvm/lapic.c:1029 Call Trace: kvm_apic_set_irq+0xb4/0x140 arch/x86/kvm/lapic.c:558 stimer_notify_direct arch/x86/kvm/hyperv.c:648 [inline] stimer_expiration arch/x86/kvm/hyperv.c:659 [inline] kvm_hv_process_stimers+0x594/0x1650 arch/x86/kvm/hyperv.c:686 vcpu_enter_guest+0x2b2a/0x54b0 arch/x86/kvm/x86.c:7896 vcpu_run+0x393/0xd40 arch/x86/kvm/x86.c:8152 kvm_arch_vcpu_ioctl_run+0x636/0x900 arch/x86/kvm/x86.c:8360 kvm_vcpu_ioctl+0x6cf/0xaf0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2765 The testcase programs HV_X64_MSR_STIMERn_CONFIG/HV_X64_MSR_STIMERn_COUNT, in addition, there is no lapic in the kernel, the counters value are small enough in order that kvm_hv_process_stimers() inject this already-expired timer interrupt into the guest through lapic in the kernel which triggers the NULL deferencing. This patch fixes it by don't advertise direct mode synthetic timers and discarding the inject when lapic is not in kernel. syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=1752fe0a600000 Reported-by: syzbot+dff25ee91f0c7d5c1695@syzkaller.appspotmail.com Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Manually flush collapsible SPTEs only when toggling flagsSean Christopherson
Zapping collapsible sptes, a.k.a. 4k sptes that can be promoted into a large page, is only necessary when changing only the dirty logging flag of a memory region. If the memslot is also being moved, then all sptes for the memslot are zapped when it is invalidated. When a memslot is being created, it is impossible for there to be existing dirty mappings, e.g. KVM can have MMIO sptes, but not present, and thus dirty, sptes. Note, the comment and logic are shamelessly borrowed from MIPS's version of kvm_arch_commit_memory_region(). Fixes: 3ea3b7fa9af06 ("kvm: mmu: lazy collapse small sptes into large sptes") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: announce KVM_CAP_HYPERV_ENLIGHTENED_VMCS support only when it is ↵Vitaly Kuznetsov
available It was discovered that after commit 65efa61dc0d5 ("selftests: kvm: provide common function to enable eVMCS") hyperv_cpuid selftest is failing on AMD. The reason is that the commit changed _vcpu_ioctl() to vcpu_ioctl() in the test and this one can't fail. Instead of fixing the test is seems to make more sense to not announce KVM_CAP_HYPERV_ENLIGHTENED_VMCS support if it is definitely missing (on svm and in case kvm_intel.nested=0). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: svm: remove unneeded nested_enable_evmcs() hookVitaly Kuznetsov
Since commit 5158917c7b019 ("KVM: x86: nVMX: Allow nested_enable_evmcs to be NULL") the code in x86.c is prepared to see nested_enable_evmcs being NULL and in VMX case it actually is when nesting is disabled. Remove the unneeded stub from SVM code. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM/Hyper-V/VMX: Add direct tlb flush supportVitaly Kuznetsov
Hyper-V provides direct tlb flush function which helps L1 Hypervisor to handle Hyper-V tlb flush request from L2 guest. Add the function support for VMX. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM/Hyper-V: Add new KVM capability KVM_CAP_HYPERV_DIRECT_TLBFLUSHTianyu Lan
Hyper-V direct tlb flush function should be enabled for guest that only uses Hyper-V hypercall. User space hypervisor(e.g, Qemu) can disable KVM identification in CPUID and just exposes Hyper-V identification to make sure the precondition. Add new KVM capability KVM_CAP_ HYPERV_DIRECT_TLBFLUSH for user space to enable Hyper-V direct tlb function and this function is default to be disabled in KVM. Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24x86/Hyper-V: Fix definition of struct hv_vp_assist_pageTianyu Lan
The struct hv_vp_assist_page was defined incorrectly. The "vtl_control" should be u64[3], "nested_enlightenments _control" should be a u64 and there are 7 reserved bytes following "enlighten_vmentry". Fix the definition. Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24kvm: x86: Add Intel PMU MSRs to msrs_to_save[]Jim Mattson
These MSRs should be enumerated by KVM_GET_MSR_INDEX_LIST, so that userspace knows that these MSRs may be part of the vCPU state. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Eric Hankland <ehankland@google.com> Reviewed-by: Peter Shier <pshier@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24powerpc/mm: Fixup tlbie vs mtpidr/mtlpidr ordering issue on POWER9Aneesh Kumar K.V
On POWER9, under some circumstances, a broadcast TLB invalidation will fail to invalidate the ERAT cache on some threads when there are parallel mtpidr/mtlpidr happening on other threads of the same core. This can cause stores to continue to go to a page after it's unmapped. The workaround is to force an ERAT flush using PID=0 or LPID=0 tlbie flush. This additional TLB flush will cause the ERAT cache invalidation. Since we are using PID=0 or LPID=0, we don't get filtered out by the TLB snoop filtering logic. We need to still follow this up with another tlbie to take care of store vs tlbie ordering issue explained in commit: a5d4b5891c2f ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9"). The presence of ERAT cache implies we can still get new stores and they may miss store queue marking flush. Cc: stable@vger.kernel.org Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190924035254.24612-3-aneesh.kumar@linux.ibm.com
2019-09-24powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flagAneesh Kumar K.V
Rename the #define to indicate this is related to store vs tlbie ordering issue. In the next patch, we will be adding another feature flag that is used to handles ERAT flush vs tlbie ordering issue. Fixes: a5d4b5891c2f ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9") Cc: stable@vger.kernel.org # v4.16+ Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190924035254.24612-2-aneesh.kumar@linux.ibm.com
2019-09-24powerpc/book3s64/mm: Don't do tlbie fixup for some hardware revisionsAneesh Kumar K.V
The store ordering vs tlbie issue mentioned in commit a5d4b5891c2f ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9") is fixed for Nimbus 2.3 and Cumulus 1.3 revisions. We don't need to apply the fixup if we are running on them We can only do this on PowerNV. On pseries guest with KVM we still don't support redoing the feature fixup after migration. So we should be enabling all the workarounds needed, because whe can possibly migrate between DD 2.3 and DD 2.2 Fixes: a5d4b5891c2f ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9") Cc: stable@vger.kernel.org # v4.16+ Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190924035254.24612-1-aneesh.kumar@linux.ibm.com
2019-09-24powerpc/pseries: Call H_BLOCK_REMOVE when supportedLaurent Dufour
Depending on the hardware and the hypervisor, the hcall H_BLOCK_REMOVE may not be able to process all the page sizes for a segment base page size, as reported by the TLB Invalidate Characteristics. For each pair of base segment page size and actual page size, this characteristic tells us the size of the block the hcall supports. In the case, the hcall is not supporting a pair of base segment page size, actual page size, it is returning H_PARAM which leads to a panic like this: kernel BUG at /home/srikar/work/linux.git/arch/powerpc/platforms/pseries/lpar.c:466! Oops: Exception in kernel mode, sig: 5 [#1] BE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 28 PID: 583 Comm: modprobe Not tainted 5.2.0-master #5 NIP: c0000000000be8dc LR: c0000000000be880 CTR: 0000000000000000 REGS: c0000007e77fb130 TRAP: 0700 Not tainted (5.2.0-master) MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI> CR: 42224824 XER: 20000000 CFAR: c0000000000be8fc IRQMASK: 0 GPR00: 0000000022224828 c0000007e77fb3c0 c000000001434d00 0000000000000005 GPR04: 9000000004fa8c00 0000000000000000 0000000000000003 0000000000000001 GPR08: c0000007e77fb450 0000000000000000 0000000000000001 ffffffffffffffff GPR12: c0000007e77fb450 c00000000edfcb80 0000cd7d3ea30000 c0000000016022b0 GPR16: 00000000000000b0 0000cd7d3ea30000 0000000000000001 c080001f04f00105 GPR20: 0000000000000003 0000000000000004 c000000fbeb05f58 c000000001602200 GPR24: 0000000000000000 0000000000000004 8800000000000000 c000000000c5d148 GPR28: c000000000000000 8000000000000000 a000000000000000 c0000007e77fb580 NIP [c0000000000be8dc] .call_block_remove+0x12c/0x220 LR [c0000000000be880] .call_block_remove+0xd0/0x220 Call Trace: 0xc000000fb8c00240 (unreliable) .pSeries_lpar_flush_hash_range+0x578/0x670 .flush_hash_range+0x44/0x100 .__flush_tlb_pending+0x3c/0xc0 .zap_pte_range+0x7ec/0x830 .unmap_page_range+0x3f4/0x540 .unmap_vmas+0x94/0x120 .exit_mmap+0xac/0x1f0 .mmput+0x9c/0x1f0 .do_exit+0x388/0xd60 .do_group_exit+0x54/0x100 .__se_sys_exit_group+0x14/0x20 system_call+0x5c/0x70 Instruction dump: 39400001 38a00000 4800003c 60000000 60420000 7fa9e800 38e00000 419e0014 7d29d278 7d290074 7929d182 69270001 <0b070000> 7d495378 394a0001 7fa93040 The call to H_BLOCK_REMOVE should only be made for the supported pair of base segment page size, actual page size and using the correct maximum block size. Due to the required complexity in do_block_remove() and call_block_remove(), and the fact that currently a block size of 8 is returned by the hypervisor, we are only supporting 8 size block to the H_BLOCK_REMOVE hcall. In order to identify this limitation easily in the code, a local define HBLKR_SUPPORTED_SIZE defining the currently supported block size, and a dedicated checking helper is_supported_hlbkr() are introduced. For regular pages and hugetlb, the assumption is made that the page size is equal to the base page size. For THP the page size is assumed to be 16M. Fixes: ba2dd8a26baa ("powerpc/pseries/mm: call H_BLOCK_REMOVE") Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190920130523.20441-3-ldufour@linux.ibm.com
2019-09-24powerpc/pseries: Read TLB Block Invalidate CharacteristicsLaurent Dufour
The PAPR document specifies the TLB Block Invalidate Characteristics which tells for each pair of segment base page size, actual page size, the size of the block the hcall H_BLOCK_REMOVE supports. These characteristics are loaded at boot time in a new table hblkr_size. The table is separate from the mmu_psize_def because this is specific to the pseries platform. A new init function, pseries_lpar_read_hblkrm_characteristics() is added to read the characteristics. It is called from pSeries_setup_arch(). Fixes: ba2dd8a26baa ("powerpc/pseries/mm: call H_BLOCK_REMOVE") Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190920130523.20441-2-ldufour@linux.ibm.com