summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)Author
2019-09-25Revert "locking/pvqspinlock: Don't wait if vCPU is preempted"Wanpeng Li
This patch reverts commit 75437bb304b20 (locking/pvqspinlock: Don't wait if vCPU is preempted). A large performance regression was caused by this commit. on over-subscription scenarios. The test was run on a Xeon Skylake box, 2 sockets, 40 cores, 80 threads, with three VMs of 80 vCPUs each. The score of ebizzy -M is reduced from 13000-14000 records/s to 1700-1800 records/s: Host Guest score vanilla w/o kvm optimizations upstream 1700-1800 records/s vanilla w/o kvm optimizations revert 13000-14000 records/s vanilla w/ kvm optimizations upstream 4500-5000 records/s vanilla w/ kvm optimizations revert 14000-15500 records/s Exit from aggressive wait-early mechanism can result in premature yield and extra scheduling latency. Actually, only 6% of wait_early events are caused by vcpu_is_preempted() being true. However, when one vCPU voluntarily releases its vCPU, all the subsequently waiters in the queue will do the same and the cascading effect leads to bad performance. kvm optimizations: [1] commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts) [2] commit 266e85a5ec9 (KVM: X86: Boost queue head vCPU to mitigate lock waiter preemption) Tested-by: loobinliu@tencent.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: loobinliu@tencent.com Cc: stable@vger.kernel.org Fixes: 75437bb304b20 (locking/pvqspinlock: Don't wait if vCPU is preempted) Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24kvm: nvmx: limit atomic switch MSRsMarc Orr
Allowing an unlimited number of MSRs to be specified via the VMX load/store MSR lists (e.g., vm-entry MSR load list) is bad for two reasons. First, a guest can specify an unreasonable number of MSRs, forcing KVM to process all of them in software. Second, the SDM bounds the number of MSRs allowed to be packed into the atomic switch MSR lists. Quoting the "Miscellaneous Data" section in the "VMX Capability Reporting Facility" appendix: "Bits 27:25 is used to compute the recommended maximum number of MSRs that should appear in the VM-exit MSR-store list, the VM-exit MSR-load list, or the VM-entry MSR-load list. Specifically, if the value bits 27:25 of IA32_VMX_MISC is N, then 512 * (N + 1) is the recommended maximum number of MSRs to be included in each list. If the limit is exceeded, undefined processor behavior may result (including a machine check during the VMX transition)." Because KVM needs to protect itself and can't model "undefined processor behavior", arbitrarily force a VM-entry to fail due to MSR loading when the MSR load list is too large. Similarly, trigger an abort during a VM exit that encounters an MSR load list or MSR store list that is too large. The MSR list size is intentionally not pre-checked so as to maintain compatibility with hardware inasmuch as possible. Test these new checks with the kvm-unit-test "x86: nvmx: test max atomic switch MSRs". Suggested-by: Jim Mattson <jmattson@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Signed-off-by: Marc Orr <marcorr@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24kvm: svm: Intercept RDPRUJim Mattson
The RDPRU instruction gives the guest read access to the IA32_APERF MSR and the IA32_MPERF MSR. According to volume 3 of the APM, "When virtualization is enabled, this instruction can be intercepted by the Hypervisor. The intercept bit is at VMCB byte offset 10h, bit 14." Since we don't enumerate the instruction in KVM_SUPPORTED_CPUID, intercept it and synthesize #UD. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Drew Schmitt <dasch@google.com> Reviewed-by: Jacob Xu <jacobhxu@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24kvm: x86: Add "significant index" flag to a few CPUID leavesJim Mattson
According to the Intel SDM, volume 2, "CPUID," the index is significant (or partially significant) for CPUID leaves 0FH, 10H, 12H, 17H, 18H, and 1FH. Add the corresponding flag to these CPUID leaves in do_host_cpuid(). Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Steve Rutherford <srutherford@google.com> Fixes: a87f2d3a6eadab ("KVM: x86: Add Intel CPUID.1F cpuid emulation support") Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Skip invalid pages during zapping iff root_count is zeroSean Christopherson
Do not skip invalid shadow pages when zapping obsolete pages if the pages' root_count has reached zero, in which case the page can be immediately zapped and freed. Update the comment accordingly. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Explicitly track only a single invalid mmu generationSean Christopherson
Toggle mmu_valid_gen between '0' and '1' instead of blindly incrementing the generation. Because slots_lock is held for the entire duration of zapping obsolete pages, it's impossible for there to be multiple invalid generations associated with shadow pages at any given time. Toggling between the two generations (valid vs. invalid) allows changing mmu_valid_gen from an unsigned long to a u8, which reduces the size of struct kvm_mmu_page from 160 to 152 bytes on 64-bit KVM, i.e. reduces KVM's memory footprint by 8 bytes per shadow page. Set sp->mmu_valid_gen before it is added to active_mmu_pages. Functionally this has no effect as kvm_mmu_alloc_page() has a single caller that sets sp->mmu_valid_gen soon thereafter, but visually it is jarring to see a shadow page being added to the list without its mmu_valid_gen first being set. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Revert "KVM: x86/mmu: Remove is_obsolete() call"Sean Christopherson
Now that the fast invalidate mechanism has been reintroduced, restore the performance tweaks for fast invalidation that existed prior to its removal. Paraphrasing the original changelog (commit 5ff0568374ed2 was itself a partial revert): Don't force reloading the remote mmu when zapping an obsolete page, as a MMU_RELOAD request has already been issued by kvm_mmu_zap_all_fast() immediately after incrementing mmu_valid_gen, i.e. after marking pages obsolete. This reverts commit 5ff0568374ed2e585376a3832857ade5daccd381. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Revert "Revert "KVM: MMU: reclaim the zapped-obsolete page first""Sean Christopherson
Now that the fast invalidate mechanism has been reintroduced, restore the performance tweaks for fast invalidation that existed prior to its removal. Paraphrashing the original changelog: Introduce a per-VM list to track obsolete shadow pages, i.e. pages which have been deleted from the mmu cache but haven't yet been freed. When page reclaiming is needed, zap/free the deleted pages first. This reverts commit 52d5dedc79bdcbac2976159a172069618cf31be5. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Revert "Revert "KVM: MMU: collapse TLB flushes when zap all ↵Sean Christopherson
pages"" Now that the fast invalidate mechanism has been reintroduced, restore the performance tweaks for fast invalidation that existed prior to its removal. Paraphrashing the original changelog: Reload the mmu on all vCPUs after updating the generation number so that obsolete pages are not used by any vCPUs. This allows collapsing all TLB flushes during obsolete page zapping into a single flush, as there is no need to flush when dropping mmu_lock (to reschedule). Note: a remote TLB flush is still needed before freeing the pages as other vCPUs may be doing a lockless shadow page walk. Opportunstically improve the comments restored by the revert (the code itself is a true revert). This reverts commit f34d251d66ba263c077ed9d2bbd1874339a4c887. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Revert "Revert "KVM: MMU: zap pages in batch""Sean Christopherson
Now that the fast invalidate mechanism has been reintroduced, restore the performance tweaks for fast invalidation that existed prior to its removal. Paraphrashing the original changelog: Zap at least 10 shadow pages before releasing mmu_lock to reduce the overhead associated with re-acquiring the lock. Note: "10" is an arbitrary number, speculated to be high enough so that a vCPU isn't stuck zapping obsolete pages for an extended period, but small enough so that other vCPUs aren't starved waiting for mmu_lock. This reverts commit 43d2b14b105fb00b8864c7b0ee7043cc1cc4a969. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Revert "Revert "KVM: MMU: add tracepoint for ↵Sean Christopherson
kvm_mmu_invalidate_all_pages"" Now that the fast invalidate mechanism has been reintroduced, restore the tracepoint associated with said mechanism. Note, the name of the tracepoint deviates from the original tracepoint so as to match KVM's current nomenclature. This reverts commit 42560fb1f3c6c7f730897b7fa7a478bc37e0be50. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Revert "Revert "KVM: MMU: show mmu_valid_gen in shadow page ↵Sean Christopherson
related tracepoints"" Now that the fast invalidate mechanism has been reintroduced, restore tracing of the generation number in shadow page tracepoints. This reverts commit b59c4830ca185ba0e9f9e046fb1cd10a4a92627a. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Use fast invalidate mechanism to zap MMIO sptesSean Christopherson
Use the fast invalidate mechasim to zap MMIO sptes on a MMIO generation wrap. The fast invalidate flow was reintroduced to fix a livelock bug in kvm_mmu_zap_all() that can occur if kvm_mmu_zap_all() is invoked when the guest has live vCPUs. I.e. using kvm_mmu_zap_all() to handle the MMIO generation wrap is theoretically susceptible to the livelock bug. This effectively reverts commit 4771450c345dc ("Revert "KVM: MMU: drop kvm_mmu_zap_mmio_sptes""), i.e. restores the behavior of commit a8eca9dcc656a ("KVM: MMU: drop kvm_mmu_zap_mmio_sptes"). Note, this actually fixes commit 571c5af06e303 ("KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptes"), but there is no need to incrementally revert back to using fast invalidate, e.g. doing so doesn't provide any bisection or stability benefits. Fixes: 571c5af06e303 ("KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptes") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Treat invalid shadow pages as obsoleteSean Christopherson
Treat invalid shadow pages as obsolete to fix a bug where an obsolete and invalid page with a non-zero root count could become non-obsolete due to mmu_valid_gen wrapping. The bug is largely theoretical with the current code base, as an unsigned long will effectively never wrap on 64-bit KVM, and userspace would have to deliberately stall a vCPU in order to keep an obsolete invalid page on the active list while simultaneously modifying memslots billions of times to trigger a wrap. The obvious alternative is to use a 64-bit value for mmu_valid_gen, but it's actually desirable to go in the opposite direction, i.e. using a smaller 8-bit value to reduce KVM's memory footprint by 8 bytes per shadow page, and relying on proper treatment of invalid pages instead of preventing the generation from wrapping. Note, "Fixes" points at a commit that was at one point reverted, but has since been restored. Fixes: 5304b8d37c2a5 ("KVM: MMU: fast invalidate all pages") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: LAPIC: Tune lapic_timer_advance_ns smoothlyWanpeng Li
Filter out drastic fluctuation and random fluctuation, remove timer_advance_adjust_done altogether, the adjustment would be continuous. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: vmx: Introduce handle_unexpected_vmexit and handle WAITPKG vmexitTao Xu
As the latest Intel 64 and IA-32 Architectures Software Developer's Manual, UMWAIT and TPAUSE instructions cause a VM exit if the RDTSC exiting and enable user wait and pause VM-execution controls are both 1. Because KVM never enable RDTSC exiting, the vm-exit for UMWAIT and TPAUSE should never happen. Considering EXIT_REASON_XSAVES and EXIT_REASON_XRSTORS is also unexpected VM-exit for KVM. Introduce a common exit helper handle_unexpected_vmexit() to handle these unexpected VM-exit. Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Jingqi Liu <jingqi.liu@intel.com> Signed-off-by: Jingqi Liu <jingqi.liu@intel.com> Signed-off-by: Tao Xu <tao3.xu@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROLTao Xu
UMWAIT and TPAUSE instructions use 32bit IA32_UMWAIT_CONTROL at MSR index E1H to determines the maximum time in TSC-quanta that the processor can reside in either C0.1 or C0.2. This patch emulates MSR IA32_UMWAIT_CONTROL in guest and differentiate IA32_UMWAIT_CONTROL between host and guest. The variable mwait_control_cached in arch/x86/kernel/cpu/umwait.c caches the MSR value, so this patch uses it to avoid frequently rdmsr of IA32_UMWAIT_CONTROL. Co-developed-by: Jingqi Liu <jingqi.liu@intel.com> Signed-off-by: Jingqi Liu <jingqi.liu@intel.com> Signed-off-by: Tao Xu <tao3.xu@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Add support for user wait instructionsTao Xu
UMONITOR, UMWAIT and TPAUSE are a set of user wait instructions. This patch adds support for user wait instructions in KVM. Availability of the user wait instructions is indicated by the presence of the CPUID feature flag WAITPKG CPUID.0x07.0x0:ECX[5]. User wait instructions may be executed at any privilege level, and use 32bit IA32_UMWAIT_CONTROL MSR to set the maximum time. The behavior of user wait instructions in VMX non-root operation is determined first by the setting of the "enable user wait and pause" secondary processor-based VM-execution control bit 26. If the VM-execution control is 0, UMONITOR/UMWAIT/TPAUSE cause an invalid-opcode exception (#UD). If the VM-execution control is 1, treatment is based on the setting of the “RDTSC exiting†VM-execution control. Because KVM never enables RDTSC exiting, if the instruction causes a delay, the amount of time delayed is called here the physical delay. The physical delay is first computed by determining the virtual delay. If IA32_UMWAIT_CONTROL[31:2] is zero, the virtual delay is the value in EDX:EAX minus the value that RDTSC would return; if IA32_UMWAIT_CONTROL[31:2] is not zero, the virtual delay is the minimum of that difference and AND(IA32_UMWAIT_CONTROL,FFFFFFFCH). Because umwait and tpause can put a (psysical) CPU into a power saving state, by default we dont't expose it to kvm and enable it only when guest CPUID has it. Detailed information about user wait instructions can be found in the latest Intel 64 and IA-32 Architectures Software Developer's Manual. Co-developed-by: Jingqi Liu <jingqi.liu@intel.com> Signed-off-by: Jingqi Liu <jingqi.liu@intel.com> Signed-off-by: Tao Xu <tao3.xu@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Add comments to document various emulation typesSean Christopherson
Document the intended usage of each emulation type as each exists to handle an edge case of one kind or another and can be easily misinterpreted at first glance. Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: VMX: Handle single-step #DB for EMULTYPE_SKIP on EPT misconfigSean Christopherson
VMX's EPT misconfig flow to handle fast-MMIO path falls back to decoding the instruction to determine the instruction length when running as a guest (Hyper-V doesn't fill VMCS.VM_EXIT_INSTRUCTION_LEN because it's technically not defined for EPT misconfigs). Rather than implement the slow skip in VMX's generic skip_emulated_instruction(), handle_ept_misconfig() directly calls kvm_emulate_instruction() with EMULTYPE_SKIP, which intentionally doesn't do single-step detection, and so handle_ept_misconfig() misses a single-step #DB. Rework the EPT misconfig fallback case to route it through kvm_skip_emulated_instruction() so that single-step #DBs and interrupt shadow updates are handled automatically. I.e. make VMX's slow skip logic match SVM's and have the SVM flow not intentionally avoid the shadow update. Alternatively, the handle_ept_misconfig() could manually handle single- step detection, but that results in EMULTYPE_SKIP having split logic for the interrupt shadow vs. single-step #DBs, and split emulator logic is largely what led to this mess in the first place. Modifying SVM to mirror VMX flow isn't really an option as SVM's case isn't limited to a specific exit reason, i.e. handling the slow skip in skip_emulated_instruction() is mandatory for all intents and purposes. Drop VMX's skip_emulated_instruction() wrapper since it can now fail, and instead WARN if it fails unexpectedly, e.g. if exit_reason somehow becomes corrupted. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Fixes: d391f12070672 ("x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Remove emulation_result enums, EMULATE_{DONE,FAIL,USER_EXIT}Sean Christopherson
Deferring emulation failure handling (in some cases) to the caller of x86_emulate_instruction() has proven fragile, e.g. multiple instances of KVM not setting run->exit_reason on EMULATE_FAIL, largely due to it being difficult to discern what emulation types can return what result, and which combination of types and results are handled where. Now that x86_emulate_instruction() always handles emulation failure, i.e. EMULATION_FAIL is only referenced in callers, remove the emulation_result enums entirely. Per KVM's existing exit handling conventions, return '0' and '1' for "exit to userspace" and "resume guest" respectively. Doing so cleans up many callers, e.g. they can return kvm_emulate_instruction() directly instead of having to interpret its result. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: VMX: Remove EMULATE_FAIL handling in handle_invalid_guest_state()Sean Christopherson
Now that EMULATE_FAIL is completely unused, remove the last remaning usage where KVM does something functional in response to EMULATE_FAIL. Leave the check in place as a WARN_ON_ONCE to provide a better paper trail when EMULATE_{DONE,FAIL,USER_EXIT} are completely removed. Opportunistically remove the gotos in handle_invalid_guest_state(). With the EMULATE_FAIL handling gone there is no need to have a common handler for emulation failure and the gotos only complicate things, e.g. the signal_pending() check always returns '1', but this is far from obvious when glancing through the code. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Move triple fault request into RM int injectionSean Christopherson
Request triple fault in kvm_inject_realmode_interrupt() instead of returning EMULATE_FAIL and deferring to the caller. All existing callers request triple fault and it's highly unlikely Real Mode is going to acquire new features. While this consolidates a small amount of code, the real goal is to remove the last reference to EMULATE_FAIL. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Handle emulation failure directly in kvm_task_switch()Sean Christopherson
Consolidate the reporting of emulation failure into kvm_task_switch() so that it can return EMULATE_USER_EXIT. This helps pave the way for removing EMULATE_FAIL altogether. This also fixes a theoretical bug where task switch interception could suppress an EMULATE_USER_EXIT return. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Exit to userspace on emulation skip failureSean Christopherson
Kill a few birds with one stone by forcing an exit to userspace on skip emulation failure. This removes a reference to EMULATE_FAIL, fixes a bug in handle_ept_misconfig() where it would exit to userspace without setting run->exit_reason, and fixes a theoretical bug in SVM's task_switch_interception() where it would overwrite run->exit_reason on a return of EMULATE_USER_EXIT. Note, this technically doesn't fully fix task_switch_interception() as it now incorrectly handles EMULATE_FAIL, but in practice there is no bug as EMULATE_FAIL will never be returned for EMULTYPE_SKIP. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Move #UD injection for failed emulation into emulation codeSean Christopherson
Immediately inject a #UD and return EMULATE done if emulation fails when handling an intercepted #UD. This helps pave the way for removing EMULATE_FAIL altogether. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Add explicit flag for forced emulation on #UDSean Christopherson
Add an explicit emulation type for forced #UD emulation and use it to detect that KVM should unconditionally inject a #UD instead of falling into its standard emulation failure handling. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Move #GP injection for VMware into x86_emulate_instruction()Sean Christopherson
Immediately inject a #GP when VMware emulation fails and return EMULATE_DONE instead of propagating EMULATE_FAIL up the stack. This helps pave the way for removing EMULATE_FAIL altogether. Rename EMULTYPE_VMWARE to EMULTYPE_VMWARE_GP to document that the x86 emulator is called to handle VMware #GP interception, e.g. why a #GP is injected on emulation failure for EMULTYPE_VMWARE_GP. Drop EMULTYPE_NO_UD_ON_FAIL as a standalone type. The "no #UD on fail" is used only in the VMWare case and is obsoleted by having the emulator itself reinject #GP. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Don't attempt VMWare emulation on #GP with non-zero error codeSean Christopherson
The VMware backdoor hooks #GP faults on IN{S}, OUT{S}, and RDPMC, none of which generate a non-zero error code for their #GP. Re-injecting #GP instead of attempting emulation on a non-zero error code will allow a future patch to move #GP injection (for emulation failure) into kvm_emulate_instruction() without having to plumb in the error code. Reviewed-and-tested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Refactor kvm_vcpu_do_singlestep() to remove out paramSean Christopherson
Return the single-step emulation result directly instead of via an out param. Presumably at some point in the past kvm_vcpu_do_singlestep() could be called with *r==EMULATE_USER_EXIT, but that is no longer the case, i.e. all callers are happy to overwrite their own return variable. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Clean up handle_emulation_failure()Sean Christopherson
When handling emulation failure, return the emulation result directly instead of capturing it in a local variable. Future patches will move additional cases into handle_emulation_failure(), clean up the cruft before so there isn't an ugly mix of setting a local variable and returning directly. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Relocate MMIO exit stats countingSean Christopherson
Move the stat.mmio_exits update into x86_emulate_instruction(). This is both a bug fix, e.g. the current update flows will incorrectly increment mmio_exits on emulation failure, and a preparatory change to set the stage for eliminating EMULATE_DONE and company. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: nVMX: Check Host Address Space Size on vmentry of nested guestsKrish Sadhukhan
According to section "Checks Related to Address-Space Size" in Intel SDM vol 3C, the following checks are performed on vmentry of nested guests: If the logical processor is outside IA-32e mode (if IA32_EFER.LMA = 0) at the time of VM entry, the following must hold: - The "IA-32e mode guest" VM-entry control is 0. - The "host address-space size" VM-exit control is 0. If the logical processor is in IA-32e mode (if IA32_EFER.LMA = 1) at the time of VM entry, the "host address-space size" VM-exit control must be 1. If the "host address-space size" VM-exit control is 0, the following must hold: - The "IA-32e mode guest" VM-entry control is 0. - Bit 17 of the CR4 field (corresponding to CR4.PCIDE) is 0. - Bits 63:32 in the RIP field are 0. If the "host address-space size" VM-exit control is 1, the following must hold: - Bit 5 of the CR4 field (corresponding to CR4.PAE) is 1. - The RIP field contains a canonical address. On processors that do not support Intel 64 architecture, checks are performed to ensure that the "IA-32e mode guest" VM-entry control and the "host address-space size" VM-exit control are both 0. Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: selftests: hyperv_cpuid: add check for NoNonArchitecturalCoreSharing bitVitaly Kuznetsov
The bit is supposed to be '1' when SMT is not supported or forcefully disabled and '0' otherwise. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: hyper-v: set NoNonArchitecturalCoreSharing CPUID bit when SMT is ↵Vitaly Kuznetsov
impossible Hyper-V 2019 doesn't expose MD_CLEAR CPUID bit to guests when it cannot guarantee that two virtual processors won't end up running on sibling SMT threads without knowing about it. This is done as an optimization as in this case there is nothing the guest can do to protect itself against MDS and issuing additional flush requests is just pointless. On bare metal the topology is known, however, when Hyper-V is running nested (e.g. on top of KVM) it needs an additional piece of information: a confirmation that the exposed topology (wrt vCPU placement on different SMT threads) is trustworthy. NoNonArchitecturalCoreSharing (CPUID 0x40000004 EAX bit 18) is described in TLFS as follows: "Indicates that a virtual processor will never share a physical core with another virtual processor, except for virtual processors that are reported as sibling SMT threads." From KVM we can give such guarantee in two cases: - SMT is unsupported or forcefully disabled (just 'disabled' doesn't work as it can become re-enabled during the lifetime of the guest). - vCPUs are properly pinned so the scheduler won't put them on sibling SMT threads (when they're not reported as such). This patch reports NoNonArchitecturalCoreSharing bit in to userspace in the first case. The second case is outside of KVM's domain of responsibility (as vCPU pinning is actually done by someone who manages KVM's userspace - e.g. libvirt pinning QEMU threads). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24cpu/SMT: create and export cpu_smt_possible()Vitaly Kuznetsov
KVM needs to know if SMT is theoretically possible, this means it is supported and not forcefully disabled ('nosmt=force'). Create and export cpu_smt_possible() answering this question. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: hyperv: Fix Direct Synthetic timers assert an interrupt w/o lapic_in_kernelWanpeng Li
Reported by syzkaller: kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN RIP: 0010:__apic_accept_irq+0x46/0x740 arch/x86/kvm/lapic.c:1029 Call Trace: kvm_apic_set_irq+0xb4/0x140 arch/x86/kvm/lapic.c:558 stimer_notify_direct arch/x86/kvm/hyperv.c:648 [inline] stimer_expiration arch/x86/kvm/hyperv.c:659 [inline] kvm_hv_process_stimers+0x594/0x1650 arch/x86/kvm/hyperv.c:686 vcpu_enter_guest+0x2b2a/0x54b0 arch/x86/kvm/x86.c:7896 vcpu_run+0x393/0xd40 arch/x86/kvm/x86.c:8152 kvm_arch_vcpu_ioctl_run+0x636/0x900 arch/x86/kvm/x86.c:8360 kvm_vcpu_ioctl+0x6cf/0xaf0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2765 The testcase programs HV_X64_MSR_STIMERn_CONFIG/HV_X64_MSR_STIMERn_COUNT, in addition, there is no lapic in the kernel, the counters value are small enough in order that kvm_hv_process_stimers() inject this already-expired timer interrupt into the guest through lapic in the kernel which triggers the NULL deferencing. This patch fixes it by don't advertise direct mode synthetic timers and discarding the inject when lapic is not in kernel. syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=1752fe0a600000 Reported-by: syzbot+dff25ee91f0c7d5c1695@syzkaller.appspotmail.com Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: Manually flush collapsible SPTEs only when toggling flagsSean Christopherson
Zapping collapsible sptes, a.k.a. 4k sptes that can be promoted into a large page, is only necessary when changing only the dirty logging flag of a memory region. If the memslot is also being moved, then all sptes for the memslot are zapped when it is invalidated. When a memslot is being created, it is impossible for there to be existing dirty mappings, e.g. KVM can have MMIO sptes, but not present, and thus dirty, sptes. Note, the comment and logic are shamelessly borrowed from MIPS's version of kvm_arch_commit_memory_region(). Fixes: 3ea3b7fa9af06 ("kvm: mmu: lazy collapse small sptes into large sptes") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: selftests: Remove duplicate guest mode handlingPeter Xu
Remove the duplication code in run_test() of dirty_log_test because after some reordering of functions now we can directly use the outcome of vm_create(). Meanwhile, with the new VM_MODE_PXXV48_4K, we can safely revert b442324b58 too where we stick the x86_64 PA width to 39 bits for dirty_log_test. Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: selftests: Introduce VM_MODE_PXXV48_4KPeter Xu
The naming VM_MODE_P52V48_4K is explicit but unclear when used on x86_64 machines, because x86_64 machines are having various physical address width rather than some static values. Here's some examples: - Intel Xeon E3-1220: 36 bits - Intel Core i7-8650: 39 bits - AMD EPYC 7251: 48 bits All of them are using 48 bits linear address width but with totally different physical address width (and most of the old machines should be less than 52 bits). Let's create a new guest mode called VM_MODE_PXXV48_4K for current x86_64 tests and make it as the default to replace the old naming of VM_MODE_P52V48_4K because it shows more clearly that the PA width is not really a constant. Meanwhile we also stop assuming all the x86 machines are having 52 bits PA width but instead we fetch the real vm->pa_bits from CPUID 0x80000008 during runtime. We currently make this exclusively used by x86_64 but no other arch. As a slight touch up, moving DEBUG macro from dirty_log_test.c to kvm_util.h so lib can use it too. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: selftests: Create VM earlier for dirty log testPeter Xu
Since we've just removed the dependency of vm type in previous patch, now we can create the vm much earlier. Note that to move it earlier we used an approximation of number of extra pages but it should be fine. This prepares for the follow up patches to finally remove the duplication of guest mode parsings. Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: selftests: Move vm type into _vm_create() internallyPeter Xu
Rather than passing the vm type from the top level to the end of vm creation, let's simply keep that as an internal of kvm_vm struct and decide the type in _vm_create(). Several reasons for doing this: - The vm type is only decided by physical address width and currently only used in aarch64, so we've got enough information as long as we're passing vm_guest_mode into _vm_create(), - This removes a loop dependency between the vm->type and creation of vms. That's why now we need to parse vm_guest_mode twice sometimes, once in run_test() and then again in _vm_create(). The follow up patches will move on to clean up that as well so we can have a single place to decide guest machine types and so. Note that this patch will slightly change the behavior of aarch64 tests in that previously most vm_create() callers will directly pass in type==0 into _vm_create() but now the type will depend on vm_guest_mode, however it shouldn't affect any user because all vm_create() users of aarch64 will be using VM_MODE_DEFAULT guest mode (which is VM_MODE_P40V48_4K) so at last type will still be zero. Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: announce KVM_CAP_HYPERV_ENLIGHTENED_VMCS support only when it is ↵Vitaly Kuznetsov
available It was discovered that after commit 65efa61dc0d5 ("selftests: kvm: provide common function to enable eVMCS") hyperv_cpuid selftest is failing on AMD. The reason is that the commit changed _vcpu_ioctl() to vcpu_ioctl() in the test and this one can't fail. Instead of fixing the test is seems to make more sense to not announce KVM_CAP_HYPERV_ENLIGHTENED_VMCS support if it is definitely missing (on svm and in case kvm_intel.nested=0). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86: svm: remove unneeded nested_enable_evmcs() hookVitaly Kuznetsov
Since commit 5158917c7b019 ("KVM: x86: nVMX: Allow nested_enable_evmcs to be NULL") the code in x86.c is prepared to see nested_enable_evmcs being NULL and in VMX case it actually is when nesting is disabled. Remove the unneeded stub from SVM code. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM/Hyper-V/VMX: Add direct tlb flush supportVitaly Kuznetsov
Hyper-V provides direct tlb flush function which helps L1 Hypervisor to handle Hyper-V tlb flush request from L2 guest. Add the function support for VMX. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM/Hyper-V: Add new KVM capability KVM_CAP_HYPERV_DIRECT_TLBFLUSHTianyu Lan
Hyper-V direct tlb flush function should be enabled for guest that only uses Hyper-V hypercall. User space hypervisor(e.g, Qemu) can disable KVM identification in CPUID and just exposes Hyper-V identification to make sure the precondition. Add new KVM capability KVM_CAP_ HYPERV_DIRECT_TLBFLUSH for user space to enable Hyper-V direct tlb function and this function is default to be disabled in KVM. Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24x86/Hyper-V: Fix definition of struct hv_vp_assist_pageTianyu Lan
The struct hv_vp_assist_page was defined incorrectly. The "vtl_control" should be u64[3], "nested_enlightenments _control" should be a u64 and there are 7 reserved bytes following "enlighten_vmentry". Fix the definition. Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24kvm: x86: Add Intel PMU MSRs to msrs_to_save[]Jim Mattson
These MSRs should be enumerated by KVM_GET_MSR_INDEX_LIST, so that userspace knows that these MSRs may be part of the vCPU state. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Eric Hankland <ehankland@google.com> Reviewed-by: Peter Shier <pshier@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-23Merge tag 'mfd-next-5.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd Pull MFD updates from Lee Jones: "New Drivers: - Add support for Merrifield Basin Cove PMIC New Device Support: - Add support for Intel Tiger Lake to Intel LPSS PCI - Add support for Intel Sky Lake to Intel LPSS PCI - Add support for ST-Ericsson DB8520 to DB8500 PRCMU New Functionality: - Add RTC and PWRC support to MT6323 Fix-ups: - Clean-up include files; davinci_voicecodec, asic3, sm501, mt6397 - Ignore return values from debugfs_create*(); ab3100-*, ab8500-debugfs, aat2870-core - Device Tree changes; rn5t618, mt6397 - Use new I2C API; tps80031, 88pm860x-core, ab3100-core, bcm590xx, da9150-core, max14577, max77693, max77843, max8907, max8925-i2c, max8997, max8998, palmas, twl-core, - Remove obsolete code; da9063, jz4740-adc - Simplify semantics; timberdale, htc-i2cpld - Add 'fall-through' tags; omap-usb-host, db8500-prcmu - Remove superfluous prints; ab8500-debugfs, db8500-prcmu, fsl-imx25-tsadc, intel_soc_pmic_bxtwc, qcom_rpm, sm501 - Trivial rename/whitespace/typo fixes; mt6397-core, MAINTAINERS - Reorganise code structure; mt6397-* - Improve code consistency; intel-lpss - Use MODULE_SOFTDEP() helper; intel-lpss - Use DEFINE_RES_*() helpers; mt6397-core Bug Fixes: - Clean-up resources; max77620 - Prevent input events being dropped on resume; intel-lpss-pci - Prevent sleeping in IRQ context; ezx-pcap" * tag 'mfd-next-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (48 commits) mfd: mt6323: Add MT6323 RTC and PWRC mfd: mt6323: Replace boilerplate resource code with DEFINE_RES_* macros mfd: mt6397: Add mutex include dt-bindings: mfd: mediatek: Add MT6323 Power Controller dt-bindings: mfd: mediatek: Update RTC to include MT6323 dt-bindings: mfd: mediatek: mt6397: Change to relative paths mfd: db8500-prcmu: Support the higher DB8520 ARMSS mfd: intel-lpss: Use MODULE_SOFTDEP() instead of implicit request mfd: htc-i2cpld: Drop check because i2c_unregister_device() is NULL safe mfd: sm501: Include the GPIO driver header mfd: intel-lpss: Add Intel Skylake ACPI IDs mfd: intel-lpss: Consistently use GENMASK() mfd: Add support for Merrifield Basin Cove PMIC mfd: ezx-pcap: Replace mutex_lock with spin_lock mfd: asic3: Include the right header MAINTAINERS: altera-sysmgr: Fix typo in a filepath mfd: mt6397: Extract IRQ related code from core driver mfd: mt6397: Rename macros to something more readable mfd: Remove dev_err() usage after platform_get_irq() mfd: db8500-prcmu: Mark expected switch fall-throughs ...
2019-09-23Merge tag 'backlight-next-5.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight Pull backlight updates from Lee Jones: "Core Frameworks - Obtain scale type through sysfs New Functionality: - Provide Device Tree functionality in rave-sp-backlight - Calculate if scale type is (non-)linear in pwm_bl Fix-ups: - Simplify code in lm3630a_bl - Trivial rename/whitespace/typo fixes in lms283gf05 - Remove superfluous NULL check in tosa_lcd - Fix power state initialisation in gpio_backlight - List supported file in MAINTAINERS Bug Fixes: - Kconfig - default to not building unless requested in {LED,BACKLIGHT}_CLASS_DEVICE" * tag 'backlight-next-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight: backlight: pwm_bl: Set scale type for brightness curves specified in the DT backlight: pwm_bl: Set scale type for CIE 1931 curves backlight: Expose brightness curve type through sysfs MAINTAINERS: Add entry for stable backlight sysfs ABI documentation backlight: gpio-backlight: Correct initial power state handling video: backlight: tosa_lcd: drop check because i2c_unregister_device() is NULL safe video: backlight: Drop default m for {LCD,BACKLIGHT_CLASS_DEVICE} backlight: lms283gf05: Fix a typo in the description passed to 'devm_gpio_request_one()' backlight: lm3630a: Switch to use fwnode_property_count_uXX() backlight: rave-sp: Leave initial state and register with correct device