Intel/AMD virtualization isolation and containment
Notes
- This is the second part of a series. Read Part 1 - Process Isolation and Containment
- Unless mentioned otherwise I will be referring to Intel and Linux architecture
Virtual hardware
The key capability that enables cloud computing is the ability to separate computational activity from physical devices. This is generally referred to as virtualization.
The Popek and Goldberg Virtualization requirements are captured at a high level by,
- Virtualization constructs isomorphism from guest to host, by implementing functions
V()
andE()
- All guest state
S
is mapped onto host stateS’
through a functionV(S)
- For every state change operation
E(S)
in the guest is a corresponding state changeE’(S’)
in the host
In our case we are looking for a host Intel x86 system S'
to securely and efficiently have the state of a guest Intel x86 system S
mapped to it. One option would be to emulate the entire system in software, meeting all the requirements. Some virtualization systems use a technique called para-virtualization, which in our case (Linux/Intel) often means running a kernel in Ring 1, trapping privileged instructions and using emulation to provide the expected control flow. Both of these approaches lack elements of security and efficiency.
Processor evolution
No privilege levels
- All code runs with full privileges
- No isolation of code or data
Process virtualization
- Instructions can be executed in different security contexts
- Unprivileged code is isolated from other unprivileged code and data
- Unprivileged code is isolated from privileged code and data
- Privileged code has no restrictions and can perform hardware I/O
CPU virtualization
- The evolved system with process virtualization is completely present
- Hardware support for CPU virtualization has been added
In order to support efficient, hardware-based virtualization Intel and AMD launched separate but functionally close hardware support for Virtual Machine Extensions (VMX) in 2006. This formed the foundation that both companies would add to in the following years. Intel’s system is called VT-x, AMD’s system is called AMD-V.
Intel CPU Virtualization
As part of the VMX extensions, a new privilege system was introduced to determine access to the VMX instructions.
- VMX capable processors can run in either root or non-root mode.
- The root mode can access the VMX instructions and can run hardware-based virtual machines.
- The non-root mode cannot access the VMX instructions
- These modes are orthogonal to the existing ring-based privilege level system.
root | non-root | |
---|---|---|
Ring 0 | host kernel | guest kernel |
Ring 3 | host process | guest process |
- The VMM (Hypervisor) controls all entries to and exits from the VM
A Hypervisor, Control Program or Virtual Machine Monitor (VMM) is computer software, firmware or hardware that creates and runs virtual machines. A computer on which a hypervisor runs one or more virtual machines is called a host machine, and each virtual machine is called a guest machine.
- A multi-process kernel creates multiple processes and arranges their memory and execution so that they cannot interfere with each other.
- A VMM (Virtual Machine Monitor) creates multiple virtual machines to run software and arranges their memory and execution so that they cannot interfere with each other.
- 10 years of KVM (LWN)
- https://binarydebt.wordpress.com/2018/10/14/intel-virtualisation-how-vt-x-kvm-and-qemu-work-together/
VM exit/entry
- Instructions such as
CPUID
,MOV from/to CR3
, are intercepted as VMEXIT - Exceptions/faults such as page fault are intercepted as VMEXIT and virtualized exceptions/faults are injected on VM entry to guests
- External interrupts unrelated to guests are intercepted as VMEXIT and virtualized interrupts are injected on VMENTRY to guests
VMEXIT reasons
Category | Description |
---|---|
Exception | Any guest instruction that causes an exception |
Interrupt | An external I/O interrupt |
Root-mode sensitive | x86 privileged or sensitive instructions (e.g. hlt , pause ) |
Hypercall | vmcall - Explicit transition from non-root to root |
VT-x new | ISA extensions to control non-root execution (e.g. vmclear , vmlaunch ) |
Other reasons: triple fault (failure), legacy emulation, interrupt window, legacy I/O instructions, EPT violations.
VMEXIT security controls
- https://www.kernel.org/doc/html/v5.2/admin-guide/hw-vuln/l1tf.html
- https://lkml.org/lkml/2019/11/1/161
Nested Virtual Machines
Intel x86 architeture with VMX is a single-level virtualization architecture. This means that only a single VMM can use the processor’s VMX extensions to run guests. This requires VMX emulation by the host VMM.
The “Nested VMX” feature adds this missing capability - of running guest hypervisors (which use VMX) with their own nested guest. It does so by allowing a guest to use VMX instructions, and correctly and efficiently emulating them using the single level of VMX available in the hardware.
Since the Intel x86 architecture is a single-level virtualization architecture, only a single hypervisor can use the processor’s VMX instructions to run its guests. For unmodified guest hypervisors to use VMX instruc- tions, this single bare-metal hypervisor, which we call L 0 , needs to emulate VMX. This emulation of VMX can work recursively. Given that L 0 provides a faithful em- ulation of the VMX hardware any time there is a trap on VMX instructions, the guest running on L 1 will not
See also,
- https://www.nakivo.com/blog/hyper-v-nested-virtualization-explained/
- http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
Side channel mitigations
- Disable Simultaneous Multithreading (SMT)
- Check Kernel Page-Table Isolation (KPTI) support
- Disable Kernel Same-page Merging (KSM)
- Check for speculative branch prediction issue mitigation
- Apply L1 Terminal Fault (L1TF) mitigation
- Apply Speculative Store Bypass (SSBD) mitigation
- Use memory with Rowhammer mitigation support
- Disable swapping to disk or enable secure swap
Definitions
VMCS
Control fields
- Guest-state Processor state saved into the guest state area on VM exits and loaded on VM entries
- Host-state Processor state loaded from the host state area on VM exits
- VM-execution control Fields controlling processor operation in VMX non-root operation
- VM-exit control Fields that control VM exits
- VM-entry control Fields that control VM entries
- VM-exit information Read-only fields to receive information on VM exits describing the cause and the nature of the VM exit.
Acronyms
- GFN - Guest Frame Number
- HFN - Host Frame Number
- GVA - Guest Virtual Address
- GPA - Guest Physical Address
- SPA - System Physical Address
- EPT - Extended Page Table
- VPN - Virtual Page Number
- PFN - Page Frame Number