summaryrefslogtreecommitdiffstats
path: root/arch/x86/kvm/x86.c
AgeCommit message (Collapse)Author
2020-06-11KVM: x86: do not pass poisoned hva to __kvm_set_memory_regionPaolo Bonzini
__kvm_set_memory_region does not use the hva at all, so trying to catch use-after-delete is pointless and, worse, it fails access_ok now that we apply it to all memslots including private kernel ones. This fixes an AVIC regression. Fixes: 09d952c971a5 ("KVM: check userspace_addr for all memslots") Reported-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-11KVM: async_pf: Inject 'page ready' event only if 'page not present' was ↵Vitaly Kuznetsov
previously injected 'Page not present' event may or may not get injected depending on guest's state. If the event wasn't injected, there is no need to inject the corresponding 'page ready' event as the guest may get confused. E.g. Linux thinks that the corresponding 'page not present' event wasn't delivered *yet* and allocates a 'dummy entry' for it. This entry is never freed. Note, 'wakeup all' events have no corresponding 'page not present' event and always get injected. s390 seems to always be able to inject 'page not present', the change is effectively a nop. Suggested-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200610175532.779793-2-vkuznets@redhat.com> Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=208081 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-11KVM: x86: respect singlestep when emulating instructionFelipe Franciosi
When userspace configures KVM_GUESTDBG_SINGLESTEP, KVM will manage the presence of X86_EFLAGS_TF via kvm_set/get_rflags on vcpus. The actual rflag bit is therefore hidden from callers. That includes init_emulate_ctxt() which uses the value returned from kvm_get_flags() to set ctxt->tf. As a result, x86_emulate_instruction() will skip a single step, leaving singlestep_rip stale and not returning to userspace. This resolves the issue by observing the vcpu guest_debug configuration alongside ctxt->tf in x86_emulate_instruction(), performing the single step if set. Cc: stable@vger.kernel.org Signed-off-by: Felipe Franciosi <felipe@nutanix.com> Message-Id: <20200519081048.8204-1-felipe@nutanix.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-09KVM: x86: Unexport x86_fpu_cache and make it staticSean Christopherson
Make x86_fpu_cache static now that FPU allocation and destruction is handled entirely by common x86 code. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200608180218.20946-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-08KVM: x86: Fix APIC page invalidation raceEiichi Tsukata
Commit b1394e745b94 ("KVM: x86: fix APIC page invalidation") tried to fix inappropriate APIC page invalidation by re-introducing arch specific kvm_arch_mmu_notifier_invalidate_range() and calling it from kvm_mmu_notifier_invalidate_range_start. However, the patch left a possible race where the VMCS APIC address cache is updated *before* it is unmapped: (Invalidator) kvm_mmu_notifier_invalidate_range_start() (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD) (KVM VCPU) vcpu_enter_guest() (KVM VCPU) kvm_vcpu_reload_apic_access_page() (Invalidator) actually unmap page Because of the above race, there can be a mismatch between the host physical address stored in the APIC_ACCESS_PAGE VMCS field and the host physical address stored in the EPT entry for the APIC GPA (0xfee0000). When this happens, the processor will not trap APIC accesses, and will instead show the raw contents of the APIC-access page. Because Windows OS periodically checks for unexpected modifications to the LAPIC register, this will show up as a BSOD crash with BugCheck CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in https://bugzilla.redhat.com/show_bug.cgi?id=1751017. The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range() cannot guarantee that no additional references are taken to the pages in the range before kvm_mmu_notifier_invalidate_range_end(). Fortunately, this case is supported by the MMU notifier API, as documented in include/linux/mmu_notifier.h: * If the subsystem * can't guarantee that no additional references are taken to * the pages in the range, it has to implement the * invalidate_range() notifier to remove any references taken * after invalidate_range_start(). The fix therefore is to reload the APIC-access page field in the VMCS from kvm_mmu_notifier_invalidate_range() instead of ..._range_start(). Cc: stable@vger.kernel.org Fixes: b1394e745b94 ("KVM: x86: fix APIC page invalidation") Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=197951 Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com> Message-Id: <20200606042627.61070-1-eiichi.tsukata@nutanix.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-08Revert "KVM: x86: work around leak of uninitialized stack contents"Vitaly Kuznetsov
handle_vmptrst()/handle_vmread() stopped injecting #PF unconditionally and switched to nested_vmx_handle_memory_failure() which just kills the guest with KVM_EXIT_INTERNAL_ERROR in case of MMIO access, zeroing 'exception' in kvm_write_guest_virt_system() is not needed anymore. This reverts commit 541ab2aeb28251bf7135c7961f3a6080eebcc705. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200605115906.532682-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-04KVM: x86: minor code refactor and comments fixup around dirty loggingAnthony Yznaga
Consolidate the code and correct the comments to show that the actions taken to update existing mappings to disable or enable dirty logging are not necessary when creating, moving, or deleting a memslot. Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com> Message-Id: <1591128450-11977-4-git-send-email-anthony.yznaga@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-04KVM: x86: avoid unnecessary rmap walks when creating/moving slotsAnthony Yznaga
On large memory guests it has been observed that creating a memslot for a very large range can take noticeable amount of time. Investigation showed that the time is spent walking the rmaps to update existing sptes to remove write access or set/clear dirty bits to support dirty logging. These rmap walks are unnecessary when creating or moving a memslot. A newly created memslot will not have any existing mappings, and the existing mappings of a moved memslot will have been invalidated and flushed. Any mappings established once the new/moved memslot becomes visible will be set using the properties of the new slot. Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com> Message-Id: <1591128450-11977-3-git-send-email-anthony.yznaga@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-04KVM: x86: remove unnecessary rmap walk of read-only memslotsAnthony Yznaga
There's no write access to remove. An existing memslot cannot be updated to set or clear KVM_MEM_READONLY, and any mappings established in a newly created or moved read-only memslot will already be read-only. Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com> Message-Id: <1591128450-11977-2-git-send-email-anthony.yznaga@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01x86/kvm/hyper-v: Add support for synthetic debugger interfaceJon Doron
Add support for Hyper-V synthetic debugger (syndbg) interface. The syndbg interface is using MSRs to emulate a way to send/recv packets data. The debug transport dll (kdvm/kdnet) will identify if Hyper-V is enabled and if it supports the synthetic debugger interface it will attempt to use it, instead of trying to initialize a network adapter. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Jon Doron <arilou@gmail.com> Message-Id: <20200529134543.1127440-4-arilou@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: x86/pmu: Support full width countingLike Xu
Intel CPUs have a new alternative MSR range (starting from MSR_IA32_PMC0) for GP counters that allows writing the full counter width. Enable this range from a new capability bit (IA32_PERF_CAPABILITIES.FW_WRITE[bit 13]). The guest would query CPUID to get the counter width, and sign extends the counter values as needed. The traditional MSRs always limit to 32bit, even though the counter internally is larger (48 or 57 bits). When the new capability is set, use the alternative range which do not have these restrictions. This lowers the overhead of perf stat slightly because it has to do less interrupts to accumulate the counter value. Signed-off-by: Like Xu <like.xu@linux.intel.com> Message-Id: <20200529074347.124619-3-like.xu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' inWei Wang
Change kvm_pmu_get_msr() to get the msr_data struct, as the host_initiated field from the struct could be used by get_msr. This also makes this API consistent with kvm_pmu_set_msr. No functional changes. Signed-off-by: Wei Wang <wei.w.wang@intel.com> Message-Id: <20200529074347.124619-2-like.xu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: x86: announce KVM_FEATURE_ASYNC_PF_INTVitaly Kuznetsov
Introduce new capability to indicate that KVM supports interrupt based delivery of 'page ready' APF events. This includes support for both MSR_KVM_ASYNC_PF_INT and MSR_KVM_ASYNC_PF_ACK. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-8-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: x86: acknowledgment mechanism for async pf page ready notificationsVitaly Kuznetsov
If two page ready notifications happen back to back the second one is not delivered and the only mechanism we currently have is kvm_check_async_pf_completion() check in vcpu_run() loop. The check will only be performed with the next vmexit when it happens and in some cases it may take a while. With interrupt based page ready notification delivery the situation is even worse: unlike exceptions, interrupts are not handled immediately so we must check if the slot is empty. This is slow and unnecessary. Introduce dedicated MSR_KVM_ASYNC_PF_ACK MSR to communicate the fact that the slot is free and host should check its notification queue. Mandate using it for interrupt based 'page ready' APF event delivery. As kvm_check_async_pf_completion() is going away from vcpu_run() we need a way to communicate the fact that vcpu->async_pf.done queue has transitioned from empty to non-empty state. Introduce kvm_arch_async_page_present_queued() and KVM_REQ_APF_READY to do the job. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-7-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: x86: interrupt based APF 'page ready' event deliveryVitaly Kuznetsov
Concerns were expressed around APF delivery via synthetic #PF exception as in some cases such delivery may collide with real page fault. For 'page ready' notifications we can easily switch to using an interrupt instead. Introduce new MSR_KVM_ASYNC_PF_INT mechanism and deprecate the legacy one. One notable difference between the two mechanisms is that interrupt may not get handled immediately so whenever we would like to deliver next event (regardless of its type) we must be sure the guest had read and cleared previous event in the slot. While on it, get rid on 'type 1/type 2' names for APF events in the documentation as they are causing confusion. Use 'page not present' and 'page ready' everywhere instead. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-6-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: rename kvm_arch_can_inject_async_page_present() to ↵Vitaly Kuznetsov
kvm_arch_can_dequeue_async_page_present() An innocent reader of the following x86 KVM code: bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu) { if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED)) return true; ... may get very confused: if APF mechanism is not enabled, why do we report that we 'can inject async page present'? In reality, upon injection kvm_arch_async_page_present() will check the same condition again and, in case APF is disabled, will just drop the item. This is fine as the guest which deliberately disabled APF doesn't expect to get any APF notifications. Rename kvm_arch_can_inject_async_page_present() to kvm_arch_can_dequeue_async_page_present() to make it clear what we are checking: if the item can be dequeued (meaning either injected or just dropped). On s390 kvm_arch_can_inject_async_page_present() always returns 'true' so the rename doesn't matter much. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: x86: extend struct kvm_vcpu_pv_apf_data with token infoVitaly Kuznetsov
Currently, APF mechanism relies on the #PF abuse where the token is being passed through CR2. If we switch to using interrupts to deliver page-ready notifications we need a different way to pass the data. Extent the existing 'struct kvm_vcpu_pv_apf_data' with token information for page-ready notifications. While on it, rename 'reason' to 'flags'. This doesn't change the semantics as we only have reasons '1' and '2' and these can be treated as bit flags but KVM_PV_REASON_PAGE_READY is going away with interrupt based delivery making 'reason' name misleading. The newly introduced apf_put_user_ready() temporary puts both flags and token information, this will be changed to put token only when we switch to interrupt based notifications. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page ↵Vitaly Kuznetsov
Ready" exceptions simultaneously" Commit 9a6e7c39810e (""KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously") added a protection against 'page ready' notification coming before 'page not present' is delivered. This situation seems to be impossible since commit 2a266f23550b ("KVM MMU: check pending exception before injecting APF) which added 'vcpu->arch.exception.pending' check to kvm_can_do_async_pf. On x86, kvm_arch_async_page_present() has only one call site: kvm_check_async_pf_completion() loop and we only enter the loop when kvm_arch_can_inject_async_page_present(vcpu) which when async pf msr is enabled, translates into kvm_can_do_async_pf(). There is also one problem with the cancellation mechanism. We don't seem to check that the 'page not present' notification we're canceling matches the 'page ready' notification so in theory, we may erroneously drop two valid events. Revert the commit. Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATEPaolo Bonzini
Similar to VMX, the state that is captured through the currently available IOCTLs is a mix of L1 and L2 state, dependent on whether the L2 guest was running at the moment when the process was interrupted to save its state. In particular, the SVM-specific state for nested virtualization includes the L1 saved state (including the interrupt flag), the cached L2 controls, and the GIF. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-28KVM: nSVM: inject exceptions via svm_check_nested_eventsPaolo Bonzini
This allows exceptions injected by the emulator to be properly delivered as vmexits. The code also becomes simpler, because we can just let all L0-intercepted exceptions go through the usual path. In particular, our emulation of the VMX #DB exit qualification is very much simplified, because the vmexit injection path can use kvm_deliver_exception_payload to update DR6. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-28KVM: x86: enable event window in inject_pending_eventPaolo Bonzini
In case an interrupt arrives after nested.check_events but before the call to kvm_cpu_has_injectable_intr, we could end up enabling the interrupt window even if the interrupt is actually going to be a vmexit. This is useless rather than harmful, but it really complicates reasoning about SVM's handling of the VINTR intercept. We'd like to never bother with the VINTR intercept if V_INTR_MASKING=1 && INTERCEPT_INTR=1, because in that case there is no interrupt window and we can just exit the nested guest whenever we want. This patch moves the opening of the interrupt window inside inject_pending_event. This consolidates the check for pending interrupt/NMI/SMI in one place, and makes KVM's usage of immediate exits more consistent, extending it beyond just nested virtualization. There are two functional changes here. They only affect corner cases, but overall they simplify the inject_pending_event. - re-injection of still-pending events will also use req_immediate_exit instead of using interrupt-window intercepts. This should have no impact on performance on Intel since it simply replaces an interrupt-window or NMI-window exit for a preemption-timer exit. On AMD, which has no equivalent of the preemption time, it may incur some overhead but an actual effect on performance should only be visible in pathological cases. - kvm_arch_interrupt_allowed and kvm_vcpu_has_events will return true if an interrupt, NMI or SMI is blocked by nested_run_pending. This makes sense because entering the VM will allow it to make progress and deliver the event. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-27KVM: x86: track manually whether an event has been injectedPaolo Bonzini
Instead of calling kvm_event_needs_reinjection, track its future return value in a variable. This will be useful in the next patch. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-27KVM: x86: Initialize tdp_level during vCPU creationSean Christopherson
Initialize vcpu->arch.tdp_level during vCPU creation to avoid consuming garbage if userspace calls KVM_RUN without first calling KVM_SET_CPUID. Fixes: e93fd3b3e89e9 ("KVM: x86/mmu: Capture TDP level when updating CPUID") Reported-by: syzbot+904752567107eefb728c@syzkaller.appspotmail.com Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200527085400.23759-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-27kvm/x86: Remove redundant function implementations彭浩(Richard)
pic_in_kernel(), ioapic_in_kernel() and irqchip_kernel() have the same implementation. Signed-off-by: Peng Hao <richard.peng@oppo.com> Message-Id: <HKAPR02MB4291D5926EA10B8BFE9EA0D3E0B70@HKAPR02MB4291.apcprd02.prod.outlook.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-27KVM: Fix the indentation to match coding styleHaiwei Li
There is a bad indentation in next&queue branch. The patch looks like fixes nothing though it fixes the indentation. Before fixing: if (!handle_fastpath_set_x2apic_icr_irqoff(vcpu, data)) { kvm_skip_emulated_instruction(vcpu); ret = EXIT_FASTPATH_EXIT_HANDLED; } break; case MSR_IA32_TSCDEADLINE: After fixing: if (!handle_fastpath_set_x2apic_icr_irqoff(vcpu, data)) { kvm_skip_emulated_instruction(vcpu); ret = EXIT_FASTPATH_EXIT_HANDLED; } break; case MSR_IA32_TSCDEADLINE: Signed-off-by: Haiwei Li <lihaiwei@tencent.com> Message-Id: <2f78457e-f3a7-3bc9-e237-3132ee87f71e@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-27KVM: x86: Remove superfluous brackets from case statementSean Christopherson
Remove unnecessary brackets from a case statement that unintentionally encapsulates unrelated case statements in the same switch statement. While technically legal and functionally correct syntax, the brackets are visually confusing and potentially dangerous, e.g. the last of the encapsulated case statements has an undocumented fall-through that isn't flagged by compilers due the encapsulation. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200218234012.7110-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-27KVM: x86: allow KVM_STATE_NESTED_MTF_PENDING in kvm_state flagsPaolo Bonzini
The migration functionality was left incomplete in commit 5ef8acbdd687 ("KVM: nVMX: Emulate MTF when performing instruction emulation", 2020-02-23), fix it. Fixes: 5ef8acbdd687 ("KVM: nVMX: Emulate MTF when performing instruction emulation") Cc: stable@vger.kernel.org Reviewed-by: Oliver Upton <oupton@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-27Merge branch 'kvm-master' into HEADPaolo Bonzini
Merge AMD fixes before doing more development work.
2020-05-27KVM: x86: don't expose MSR_IA32_UMWAIT_CONTROL unconditionallyMaxim Levitsky
This msr is only available when the host supports WAITPKG feature. This breaks a nested guest, if the L1 hypervisor is set to ignore unknown msrs, because the only other safety check that the kernel does is that it attempts to read the msr and rejects it if it gets an exception. Cc: stable@vger.kernel.org Fixes: 6e3ba4abce ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20200523161455.3940-3-mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: x86: Fix off-by-one error in kvm_vcpu_ioctl_x86_setup_mceJim Mattson
Bank_num is a one-based count of banks, not a zero-based index. It overflows the allocated space only when strictly greater than KVM_MAX_MCE_BANKS. Fixes: a9e38c3e01ad ("KVM: x86: Catch potential overrun in MCE setup") Signed-off-by: Jue Wang <juew@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Message-Id: <20200511225616.19557-1-jmattson@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15kvm: add halt-polling cpu usage statsDavid Matlack
Two new stats for exposing halt-polling cpu usage: halt_poll_success_ns halt_poll_fail_ns Thus sum of these 2 stats is the total cpu time spent polling. "success" means the VCPU polled until a virtual interrupt was delivered. "fail" means the VCPU had to schedule out (either because the maximum poll time was reached or it needed to yield the CPU). To avoid touching every arch's kvm_vcpu_stat struct, only update and export halt-polling cpu usage stats if we're on x86. Exporting cpu usage as a u64 and in nanoseconds means we will overflow at ~500 years, which seems reasonably large. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Jon Cargille <jcargill@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20200508182240.68440-1-jcargill@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: X86: TSCDEADLINE MSR emulation fastpathWanpeng Li
This patch implements a fast path for emulation of writes to the TSCDEADLINE MSR. Besides shortcutting various housekeeping tasks in the vCPU loop, the fast path can also deliver the timer interrupt directly without going through KVM_REQ_PENDING_TIMER because it runs in vCPU context. Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1588055009-12677-7-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: X86: Introduce more exit_fastpath_completion enum valuesWanpeng Li
Adds a fastpath_t typedef since enum lines are a bit long, and replace EXIT_FASTPATH_SKIP_EMUL_INS with two new exit_fastpath_completion enum values. - EXIT_FASTPATH_EXIT_HANDLED kvm will still go through it's full run loop, but it would skip invoking the exit handler. - EXIT_FASTPATH_REENTER_GUEST complete fastpath, guest can be re-entered without invoking the exit handler or going back to vcpu_run Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1588055009-12677-4-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: X86: Introduce kvm_vcpu_exit_request() helperWanpeng Li
Introduce kvm_vcpu_exit_request() helper, we need to check some conditions before enter guest again immediately, we skip invoking the exit handler and go through full run loop if complete fastpath but there is stuff preventing we enter guest again immediately. Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1588055009-12677-5-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: X86: Sanity check on gfn before removalPeter Xu
The index returned by kvm_async_pf_gfn_slot() will be removed when an async pf gfn is going to be removed. However kvm_async_pf_gfn_slot() is not reliable in that it can return the last key it loops over even if the gfn is not found in the async gfn array. It should never happen, but it's still better to sanity check against that to make sure no unexpected gfn will be removed. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200416155910.267514-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: X86: Force ASYNC_PF_PER_VCPU to be power of twoPeter Xu
Forcing the ASYNC_PF_PER_VCPU to be power of two is much easier to be used rather than calling roundup_pow_of_two() from time to time. Do this by adding a BUILD_BUG_ON() inside the hash function. Another point is that generally async pf does not allow concurrency over ASYNC_PF_PER_VCPU after all (see kvm_setup_async_pf()), so it does not make much sense either to have it not a power of two or some of the entries will definitely be wasted. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200416155859.267366-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: x86/mmu: Drop KVM's hugepage enums in favor of the kernel's enumsSean Christopherson
Replace KVM's PT_PAGE_TABLE_LEVEL, PT_DIRECTORY_LEVEL and PT_PDPE_LEVEL with the kernel's PG_LEVEL_4K, PG_LEVEL_2M and PG_LEVEL_1G. KVM's enums are borderline impossible to remember and result in code that is visually difficult to audit, e.g. if (!enable_ept) ept_lpage_level = 0; else if (cpu_has_vmx_ept_1g_page()) ept_lpage_level = PT_PDPE_LEVEL; else if (cpu_has_vmx_ept_2m_page()) ept_lpage_level = PT_DIRECTORY_LEVEL; else ept_lpage_level = PT_PAGE_TABLE_LEVEL; versus if (!enable_ept) ept_lpage_level = 0; else if (cpu_has_vmx_ept_1g_page()) ept_lpage_level = PG_LEVEL_1G; else if (cpu_has_vmx_ept_2m_page()) ept_lpage_level = PG_LEVEL_2M; else ept_lpage_level = PG_LEVEL_4K; No functional change intended. Suggested-by: Barret Rhoden <brho@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428005422.4235-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15kvm: x86: Cleanup vcpu->arch.guest_xstate_sizeXiaoyao Li
vcpu->arch.guest_xstate_size lost its only user since commit df1daba7d1cb ("KVM: x86: support XSAVES usage in the host"), so clean it up. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20200429154312.1411-1-xiaoyao.li@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: x86: Save L1 TSC offset in 'struct kvm_vcpu_arch'Sean Christopherson
Save L1's TSC offset in 'struct kvm_vcpu_arch' and drop the kvm_x86_ops hook read_l1_tsc_offset(). This avoids a retpoline (when configured) when reading L1's effective TSC, which is done at least once on every VM-Exit. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: x86: Replace late check_nested_events() hack with more precise fixPaolo Bonzini
Add an argument to interrupt_allowed and nmi_allowed, to checking if interrupt injection is blocked. Use the hook to handle the case where an interrupt arrives between check_nested_events() and the injection logic. Drop the retry of check_nested_events() that hack-a-fixed the same condition. Blocking injection is also a bit of a hack, e.g. KVM should do exiting and non-exiting interrupt processing in a single pass, but it's a more precise hack. The old comment is also misleading, e.g. KVM_REQ_EVENT is purely an optimization, setting it on every run loop (which KVM doesn't do) should not affect functionality, only performance. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-13-sean.j.christopherson@intel.com> [Extend to SVM, add SMI and NMI. Even though NMI and SMI cannot come asynchronously right now, making the fix generic is easy and removes a special case. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: x86: WARN on injected+pending exception even in nested caseSean Christopherson
WARN if a pending exception is coincident with an injected exception before calling check_nested_events() so that the WARN will fire even if inject_pending_event() bails early because check_nested_events() detects the conflict. Bailing early isn't problematic (quite the opposite), but suppressing the WARN is undesirable as it could mask a bug elsewhere in KVM. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-11-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: x86: replace is_smm checks with kvm_x86_ops.smi_allowedPaolo Bonzini
Do not hardcode is_smm so that all the architectural conditions for blocking SMIs are listed in a single place. Well, in two places because this introduces some code duplication between Intel and AMD. This ensures that nested SVM obeys GIF in kvm_vcpu_has_events. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: x86: Set KVM_REQ_EVENT if run is canceled with req_immediate_exit setSean Christopherson
Re-request KVM_REQ_EVENT if vcpu_enter_guest() bails after processing pending requests and an immediate exit was requested. This fixes a bug where a pending event, e.g. VMX preemption timer, is delayed and/or lost if the exit was deferred due to something other than a higher priority _injected_ event, e.g. due to a pending nested VM-Enter. This bug only affects the !injected case as kvm_x86_ops.cancel_injection() sets KVM_REQ_EVENT to redo the injection, but that's purely serendipitous behavior with respect to the deferred event. Note, emulated preemption timer isn't the only event that can be affected, it simply happens to be the only event where not re-requesting KVM_REQ_EVENT is blatantly visible to the guest. Fixes: f4124500c2c13 ("KVM: nVMX: Fully emulate preemption timer") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: nVMX: Open a window for pending nested VMX preemption timerSean Christopherson
Add a kvm_x86_ops hook to detect a nested pending "hypervisor timer" and use it to effectively open a window for servicing the expired timer. Like pending SMIs on VMX, opening a window simply means requesting an immediate exit. This fixes a bug where an expired VMX preemption timer (for L2) will be delayed and/or lost if a pending exception is injected into L2. The pending exception is rightly prioritized by vmx_check_nested_events() and injected into L2, with the preemption timer left pending. Because no window opened, L2 is free to run uninterrupted. Fixes: f4124500c2c13 ("KVM: nVMX: Fully emulate preemption timer") Reported-by: Jim Mattson <jmattson@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Peter Shier <pshier@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-3-sean.j.christopherson@intel.com> [Check it in kvm_vcpu_has_events too, to ensure that the preemption timer is serviced promptly even if the vCPU is halted and L1 is not intercepting HLT. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13Merge branch 'kvm-amd-fixes' into HEADPaolo Bonzini
2020-05-13KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.cBabu Moger
Though rdpkru and wrpkru are contingent upon CR4.PKE, the PKRU resource isn't. It can be read with XSAVE and written with XRSTOR. So, if we don't set the guest PKRU value here(kvm_load_guest_xsave_state), the guest can read the host value. In case of kvm_load_host_xsave_state, guest with CR4.PKE clear could potentially use XRSTOR to change the host PKRU value. While at it, move pkru state save/restore to common code and the host_pkru field to kvm_vcpu_arch. This will let SVM support protection keys. Cc: stable@vger.kernel.org Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Babu Moger <babu.moger@amd.com> Message-Id: <158932794619.44260.14508381096663848853.stgit@naples-babu.amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-08KVM: SVM: Disable AVIC before setting V_IRQSuravee Suthikulpanit
The commit 64b5bd270426 ("KVM: nSVM: ignore L1 interrupt window while running L2 with V_INTR_MASKING=1") introduced a WARN_ON, which checks if AVIC is enabled when trying to set V_IRQ in the VMCB for enabling irq window. The following warning is triggered because the requesting vcpu (to deactivate AVIC) does not get to process APICv update request for itself until the next #vmexit. WARNING: CPU: 0 PID: 118232 at arch/x86/kvm/svm/svm.c:1372 enable_irq_window+0x6a/0xa0 [kvm_amd] RIP: 0010:enable_irq_window+0x6a/0xa0 [kvm_amd] Call Trace: kvm_arch_vcpu_ioctl_run+0x6e3/0x1b50 [kvm] ? kvm_vm_ioctl_irq_line+0x27/0x40 [kvm] ? _copy_to_user+0x26/0x30 ? kvm_vm_ioctl+0xb3e/0xd90 [kvm] ? set_next_entity+0x78/0xc0 kvm_vcpu_ioctl+0x236/0x610 [kvm] ksys_ioctl+0x8a/0xc0 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x58/0x210 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes by sending APICV update request to all other vcpus, and immediately update APIC for itself. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Link: https://lkml.org/lkml/2020/5/2/167 Fixes: 64b5bd270426 ("KVM: nSVM: ignore L1 interrupt window while running L2 with V_INTR_MASKING=1") Message-Id: <1588818939-54264-1-git-send-email-suravee.suthikulpanit@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-08KVM: Introduce kvm_make_all_cpus_request_except()Suravee Suthikulpanit
This allows making request to all other vcpus except the one specified in the parameter. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Message-Id: <1588771076-73790-2-git-send-email-suravee.suthikulpanit@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-08KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6Paolo Bonzini
There are two issues with KVM_EXIT_DEBUG on AMD, whose root cause is the different handling of DR6 on intercepted #DB exceptions on Intel and AMD. On Intel, #DB exceptions transmit the DR6 value via the exit qualification field of the VMCS, and the exit qualification only contains the description of the precise event that caused a vmexit. On AMD, instead the DR6 field of the VMCB is filled in as if the #DB exception was to be injected into the guest. This has two effects when guest debugging is in use: * the guest DR6 is clobbered * the kvm_run->debug.arch.dr6 field can accumulate more debug events, rather than just the last one that happened (the testcase in the next patch covers this issue). This patch fixes both issues by emulating, so to speak, the Intel behavior on AMD processors. The important observation is that (after the previous patches) the VMCB value of DR6 is only ever observable from the guest is KVM_DEBUGREG_WONT_EXIT is set. Therefore we can actually set vmcb->save.dr6 to any value we want as long as KVM_DEBUGREG_WONT_EXIT is clear, which it will be if guest debugging is enabled. Therefore it is possible to enter the guest with an all-zero DR6, reconstruct the #DB payload from the DR6 we get at exit time, and let kvm_deliver_exception_payload move the newly set bits into vcpu->arch.dr6. Some extra bits may be included in the payload if KVM_DEBUGREG_WONT_EXIT is set, but this is harmless. This may not be the most optimized way to deal with this, but it is simple and, being confined within SVM code, it gets rid of the set_dr6 callback and kvm_update_dr6. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-08KVM: SVM: keep DR6 synchronized with vcpu->arch.dr6Paolo Bonzini
kvm_x86_ops.set_dr6 is only ever called with vcpu->arch.dr6 as the second argument. Ensure that the VMCB value is synchronized to vcpu->arch.dr6 on #DB (both "normal" and nested) and nested vmentry, so that the current value of DR6 is always available in vcpu->arch.dr6. The get_dr6 callback can just access vcpu->arch.dr6 and becomes redundant. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>