Michiel Kalkman

Notes and observations

19 Jul 2020

Intel/AMD virtualization isolation and containment

Notes

Virtual hardware

The key capability that enables cloud computing is the ability to separate computational activity from physical devices. This is generally referred to as virtualization.

The Popek and Goldberg Virtualization requirements are captured at a high level by,

  • Virtualization constructs isomorphism from guest to host, by implementing functions V() and E()
  • All guest state S is mapped onto host state S’ through a function V(S)
  • For every state change operation E(S) in the guest is a corresponding state change E’(S’) in the host

In our case we are looking for a host Intel x86 system S' to securely and efficiently have the state of a guest Intel x86 system S mapped to it. One option would be to emulate the entire system in software, meeting all the requirements. Some virtualization systems use a technique called para-virtualization, which in our case (Linux/Intel) often means running a kernel in Ring 1, trapping privileged instructions and using emulation to provide the expected control flow. Both of these approaches lack elements of security and efficiency.

Processor evolution

No privilege levels

  • All code runs with full privileges
  • No isolation of code or data

Process virtualization

  • Instructions can be executed in different security contexts
  • Unprivileged code is isolated from other unprivileged code and data
  • Unprivileged code is isolated from privileged code and data
  • Privileged code has no restrictions and can perform hardware I/O

CPU virtualization

  • The evolved system with process virtualization is completely present
  • Hardware support for CPU virtualization has been added

In order to support efficient, hardware-based virtualization Intel and AMD launched separate but functionally close hardware support for Virtual Machine Extensions (VMX) in 2006. This formed the foundation that both companies would add to in the following years. Intel’s system is called VT-x, AMD’s system is called AMD-V.

Intel CPU Virtualization

As part of the VMX extensions, a new privilege system was introduced to determine access to the VMX instructions.

  • VMX capable processors can run in either root or non-root mode.
    • The root mode can access the VMX instructions and can run hardware-based virtual machines.
    • The non-root mode cannot access the VMX instructions
  • These modes are orthogonal to the existing ring-based privilege level system.
root non-root
Ring 0 host kernel guest kernel
Ring 3 host process guest process
  • The VMM (Hypervisor) controls all entries to and exits from the VM

A Hypervisor, Control Program or Virtual Machine Monitor (VMM) is computer software, firmware or hardware that creates and runs virtual machines. A computer on which a hypervisor runs one or more virtual machines is called a host machine, and each virtual machine is called a guest machine.

  • A multi-process kernel creates multiple processes and arranges their memory and execution so that they cannot interfere with each other.
  • A VMM (Virtual Machine Monitor) creates multiple virtual machines to run software and arranges their memory and execution so that they cannot interfere with each other.

VM exit/entry

  • Instructions such as CPUID, MOV from/to CR3, are intercepted as VMEXIT
  • Exceptions/faults such as page fault are intercepted as VMEXIT and virtualized exceptions/faults are injected on VM entry to guests
  • External interrupts unrelated to guests are intercepted as VMEXIT and virtualized interrupts are injected on VMENTRY to guests

VMEXIT reasons

Category Description
Exception Any guest instruction that causes an exception
Interrupt An external I/O interrupt
Root-mode sensitive x86 privileged or sensitive instructions (e.g. hlt, pause)
Hypercall vmcall - Explicit transition from non-root to root
VT-x new ISA extensions to control non-root execution (e.g. vmclear, vmlaunch)

Other reasons: triple fault (failure), legacy emulation, interrupt window, legacy I/O instructions, EPT violations.

VMEXIT security controls

Nested Virtual Machines

Intel x86 architeture with VMX is a single-level virtualization architecture. This means that only a single VMM can use the processor’s VMX extensions to run guests. This requires VMX emulation by the host VMM.

The “Nested VMX” feature adds this missing capability - of running guest hypervisors (which use VMX) with their own nested guest. It does so by allowing a guest to use VMX instructions, and correctly and efficiently emulating them using the single level of VMX available in the hardware.

Since the Intel x86 architecture is a single-level virtualization architecture, only a single hypervisor can use the processor’s VMX instructions to run its guests. For unmodified guest hypervisors to use VMX instruc- tions, this single bare-metal hypervisor, which we call L 0 , needs to emulate VMX. This emulation of VMX can work recursively. Given that L 0 provides a faithful em- ulation of the VMX hardware any time there is a trap on VMX instructions, the guest running on L 1 will not

See also,

Side channel mitigations

  • Disable Simultaneous Multithreading (SMT)
  • Check Kernel Page-Table Isolation (KPTI) support
  • Disable Kernel Same-page Merging (KSM)
  • Check for speculative branch prediction issue mitigation
  • Apply L1 Terminal Fault (L1TF) mitigation
  • Apply Speculative Store Bypass (SSBD) mitigation
  • Use memory with Rowhammer mitigation support
  • Disable swapping to disk or enable secure swap

Definitions

VMCS

Control fields

  • Guest-state Processor state saved into the guest state area on VM exits and loaded on VM entries
  • Host-state Processor state loaded from the host state area on VM exits
  • VM-execution control Fields controlling processor operation in VMX non-root operation
  • VM-exit control Fields that control VM exits
  • VM-entry control Fields that control VM entries
  • VM-exit information Read-only fields to receive information on VM exits describing the cause and the nature of the VM exit.

Acronyms

  • GFN - Guest Frame Number
  • HFN - Host Frame Number
  • GVA - Guest Virtual Address
  • GPA - Guest Physical Address
  • SPA - System Physical Address
  • EPT - Extended Page Table
  • VPN - Virtual Page Number
  • PFN - Page Frame Number

Further reading

Next time, we'll talk about "What Tiger King can teach us about x86 Assembly"