Intel/AMD virtualization isolation and containment

Notes

This is the second part of a series. Read Part 1 - Process Isolation and Containment
Unless mentioned otherwise I will be referring to Intel and Linux architecture

Virtual hardware

The key capability that enables cloud computing is the ability to separate computational activity from physical devices. This is generally referred to as virtualization.

The Popek and Goldberg Virtualization requirements are captured at a high level by,

Virtualization constructs isomorphism from guest to host, by implementing functions V() and E()
All guest state S is mapped onto host state S’ through a function V(S)
For every state change operation E(S) in the guest is a corresponding state change E’(S’) in the host

Popek and Goldberg (PG74)

In our case we are looking for a host Intel x86 system S' to securely and efficiently have the state of a guest Intel x86 system S mapped to it. One option would be to emulate the entire system in software, meeting all the requirements. Some virtualization systems use a technique called para-virtualization, which in our case (Linux/Intel) often means running a kernel in Ring 1, trapping privileged instructions and using emulation to provide the expected control flow. Both of these approaches lack elements of security and efficiency.

Processor evolution

No privilege levels

All code runs with full privileges
No isolation of code or data

Relationships between IO, Mem, CPU Our familiar CPU and memory model, sadly unable to efficiently fulfill the Popek and Goldberg requirements

Process virtualization

Instructions can be executed in different security contexts
Unprivileged code is isolated from other unprivileged code and data
Unprivileged code is isolated from privileged code and data
Privileged code has no restrictions and can perform hardware I/O

Relationships between IO, Mem, CPU Our CPU and memory model with process virtualization, unable to virtualize privileged instructions

CPU virtualization

The evolved system with process virtualization is completely present
Hardware support for CPU virtualization has been added

Relationships between IO, Mem, CPU Our familiar CPU and memory model with process virtualization and VMX, now able to fully virtualize a CPU in hardware and efficiently fulfill the Popek and Goldberg requirements

In order to support efficient, hardware-based virtualization Intel and AMD launched separate but functionally close hardware support for Virtual Machine Extensions (VMX) in 2006. This formed the foundation that both companies would add to in the following years. Intel’s system is called VT-x, AMD’s system is called AMD-V.

Intel CPU Virtualization

As part of the VMX extensions, a new privilege system was introduced to determine access to the VMX instructions.

VMX capable processors can run in either root or non-root mode.
- The root mode can access the VMX instructions and can run hardware-based virtual machines.
- The non-root mode cannot access the VMX instructions
These modes are orthogonal to the existing ring-based privilege level system.

	root	non-root
Ring 0	host kernel	guest kernel
Ring 3	host process	guest process

VMM/VM VMM and VM interaction cycle

The VMM (Hypervisor) controls all entries to and exits from the VM

Relationships between IO, Mem, CPU Flow of control - Relationship between components in a multi-tenant virtualized environment

Relationships between IO, Mem, CPU Flow of control - Windows and Linux running in virtual machines with virtual hardware provided by QEMU and KVM as the VMM/Hypervisor

A Hypervisor, Control Program or Virtual Machine Monitor (VMM) is computer software, firmware or hardware that creates and runs virtual machines. A computer on which a hypervisor runs one or more virtual machines is called a host machine, and each virtual machine is called a guest machine.

A multi-process kernel creates multiple processes and arranges their memory and execution so that they cannot interfere with each other.
A VMM (Virtual Machine Monitor) creates multiple virtual machines to run software and arranges their memory and execution so that they cannot interfere with each other.

Relationships between IO, Mem, CPU IaaS Shared Tenancy

Relationships between IO, Mem, CPU PaaS Shared Tenancy

VM exit/entry

Instructions such as CPUID, MOV from/to CR3, are intercepted as VMEXIT
Exceptions/faults such as page fault are intercepted as VMEXIT and virtualized exceptions/faults are injected on VM entry to guests
External interrupts unrelated to guests are intercepted as VMEXIT and virtualized interrupts are injected on VMENTRY to guests

VMEXIT reasons

Category	Description
Exception	Any guest instruction that causes an exception
Interrupt	An external I/O interrupt
Root-mode sensitive	x86 privileged or sensitive instructions (e.g. `hlt`, `pause`)
Hypercall	`vmcall` - Explicit transition from non-root to root
VT-x new	ISA extensions to control non-root execution (e.g. `vmclear`, `vmlaunch`)

Other reasons: triple fault (failure), legacy emulation, interrupt window, legacy I/O instructions, EPT violations.

Relationships between IO, Mem, CPU Compute, IO, Memory

VMEXIT security controls

Nested Virtual Machines

Intel x86 architeture with VMX is a single-level virtualization architecture. This means that only a single VMM can use the processor’s VMX extensions to run guests. This requires VMX emulation by the host VMM.

The “Nested VMX” feature adds this missing capability - of running guest hypervisors (which use VMX) with their own nested guest. It does so by allowing a guest to use VMX instructions, and correctly and efficiently emulating them using the single level of VMX available in the hardware.

https://www.kernel.org/doc/html/latest/virt/kvm/nested-vmx.html

Since the Intel x86 architecture is a single-level virtualization architecture, only a single hypervisor can use the processor’s VMX instructions to run its guests. For unmodified guest hypervisors to use VMX instruc- tions, this single bare-metal hypervisor, which we call L 0 , needs to emulate VMX. This emulation of VMX can work recursively. Given that L 0 provides a faithful em- ulation of the VMX hardware any time there is a trap on VMX instructions, the guest running on L 1 will not

Nested virtualization

Definitions

VMCS

Control fields

Guest-state Processor state saved into the guest state area on VM exits and loaded on VM entries
Host-state Processor state loaded from the host state area on VM exits
VM-execution control Fields controlling processor operation in VMX non-root operation
VM-exit control Fields that control VM exits
VM-entry control Fields that control VM entries
VM-exit information Read-only fields to receive information on VM exits describing the cause and the nature of the VM exit.

Acronyms

GFN - Guest Frame Number
HFN - Host Frame Number
GVA - Guest Virtual Address
GPA - Guest Physical Address
SPA - System Physical Address
EPT - Extended Page Table
VPN - Virtual Page Number
PFN - Page Frame Number

Read More