Intel/Linux process isolation and containment

The Linux process model

We’ll take a traditional multi-user Linux environment as a starting point. In this scenario, the provider runs the hardware, the operating system and supporting system functions.

drawing How it is presented - a multi-user system with two tenancy peers

It turns out that on Linux processes cannot actually do much beyond compute operations on data in their own memory space. In order to do anything outside of that a process has to request that the kernel perform an operation on its behalf. This includes operations such as,

  • Starting a new process
  • Listing the active processes on the system
  • Reading data from a file
  • Listening on a network port
  • Opening a connection to a remote network port

There is no direct communication between processes. As we are interested in the lines of communication, our depiction of the relationship between processes and kernel takes that into account.

How it works How it works - processes only communicate with the kernel

The Linux kernel leverages constructs in the CPU to ensure that processes cannot perform these actions directly. The CPU that the system is running on is able to run instructions at different privilege levels. For Intel based processors these are generally referred to as rings. The Linux kernel itself runs in Ring 0, which is a privileged domain.

A note on terminology : Linux documentation will usually not refer to CPU privilege levels directly. Linux refers to User space which is Ring 3 (unprivileged) and Kernel space which is Ring 0 (privileged). Processes run in user space and the kernel runs in kernel space.

Intel privilege rings and memory visibility Intel privilege rings and process memory visibility

For example, some operations that require Ring 0 privileges are,

  • Run all instructions
  • Write to all registers
  • Modify current segment register
  • Modify page tables (CR3 register)
  • Register interrupt handlers
  • Use IO instructions

These instructions and capabilities give the kernel full control over the system.

If we try to execute any of these as a process (that is, while running in User space / Ring 3), the CPU will trap our code (halt execution) and hand over control to the kernel to decide what needs to be done. Usually the kernel will terminate the process, log the event and optionally dump the whole memory image of the offending process to disk.

That is what has happened when you see this message come by,

Segmentation fault (core dumped)

We need our processes to open files, etc though and we can ask the kernel to execute these instructions on our behalf but there are a few steps to doing this.

The mechanism for calling the kernel is to trigger a privilege change - i.e. code executing in Ring 3 has to inform the CPU that it wants to hand over control to code in Ring 0. This is generally done through a special instruction called a system call.

With this knowledge we can re-draw the relationship of processes.

How it works How it works - dotted lines are gated, full lines are ungated

Note on privilege changes

Interrupts can also trigger a privilege change. An interrupt is an event that is generated (or raised) by software or hardware. A hardware interrupt can signal, for example, the arrival of a packet on a network interface. A software interrupt is raised by executing the int instruction.

  • Interrupt handler 0x80 (128) is the linux legacy syscall handler. Calling an interrupt transfers control to a ring 0 interupt handler. The kernel returns with the iret instruction
  • 32-bit mode fast system calls use sysenter to call into ring 0 and sysexit to return to ring 3
  • 64-bit mode fast system calls use syscall to call into ring 0 and sysret to return to ring 3

Collectively we will refer to these as the System Call Gate.

For more information,

Policy Controls on the Linux System Call Gate

So far we know that,

  1. Processes cannot communicate with other processes directly
  2. Processes cannot access resources directly
  3. Performing either of these can only be done with privileged instructions
  4. We can only access privileged instructions indirectly via system calls

That makes the System Call Gate our our primary attack surface.

Ring 3/ Ring 0 attack surface on Linux _The Ring 3 to Ring 0 attack surface on Linux _

We can find out by the area of that surface by running the command man 2 syscalls. It lists over 400 documented system calls. (We can always refer to the kernel source to find out how accurate this list is).

SYSCALLS(2)                Linux Programmer's Manual               SYSCALLS(2)

NAME
       syscalls - Linux system calls

DESCRIPTION
       The system call is the fundamental interface between an application and
       the Linux kernel.
       [...]

       System call                Kernel        Notes
       ─────────────────────────────────────────────────────────────────────
       accept(2)                  2.0           See notes on socketcall(2)
       accept4(2)                 2.6.28
       access(2)                  1.0
       acct(2)                    1.0
       add_key(2)                 2.6.10
       [...]

Preventive Control : Permissions, ACLs and Capabilities

Ownership is checked after the function is invoked _Ownership is resource-based and checked after the function is invoked _

This control is only applied once the system call function is already executing. It is also resource-based, which implies that the resource itself, or its metadata, has to be accessed in order for the control to be evaluated. In addition, Capabilities can be set on processes that allow this control to be bypassed entirely. The capability implementation itself also varies across capabilities and resources, increasing the likelihood of both bugs and evaluation errors.

The combination of those factors makes this a relatively weak control.

Permissions/ACL event tree Ownership event tree

For more information,

Preventive Control : Secure Computing Mode (SECCOMP)

SECCOMP can be used as a system call firewall _SECCOMP can be used as a system call firewall, preventing function invocation _

Notes,

  • SECCOMP has two modes - 1 (strict) or 2 (filter)
    • Strict mode only allows a process to call read(2), write(2), _exit(2) and sigreturn(2)
    • Filter mode allows system calls based on a pointer to a BPF program
  • A firewall for syscalls
  • Any reduction in the number of available system calls reduces the attack surface
  • Filter function invocation without resource context

Example : PTRACE_TRACEME Kernel Privilege escalation

This example is a bug in a system call (ptrace(2)). These happen occasionally. The kernel is under continuous development and is written in a language that does not have any built-in guarantees of safety.

nvd.nist.gov - CVE-2019-13272

In the Linux kernel before 5.1.17, ptrace_link in kernel/ptrace.c mishandles the recording of the credentials of a process that wants to create a ptrace relationship, which allows local users to obtain root access by leveraging certain scenarios with a parent-child process relationship, where a parent drops privileges and calls execve (potentially allowing control by an attacker).

With unrestricted access to the kernel system call interface, this vulnerability can be exploited by any process on the system to execute code with ring 0 privileges. For example,

redhat.com - Article 4292201

Unprivileged containers run in Kubernetes and OpenShift 4 clusters do not use seccomp filtering by default and can use the ptrace() syscall to exploit this vulnerability.

SECCOMP example SECCOMP event tree

For more information vist,

Preventive Control : Linux Security Modules

Linux Security Modules (LSM) is a framework in the Linux kernel that provides hooks from the kernel to kernel modules for the enforcement of one or more policies. Some of these modules are maintained as part of the Linux kernel source (SELinux, AppArmor, Smack), others are maintained separately. The hooks block execution for policy evaluation when sensitive functions are accessed.

“[..] the LSM framework is primarily focused on supporting access control modules, [..]. By itself, the framework does not provide any additional security; it merely provides the infrastructure to support security modules.”

Source - Kernel Documentation / Security / LSM

Notes,

  • Built into the kernel
  • Used by MAC extensions (e.g AppArmor, SELinux)
  • With no explicit LSM built-in, default capabilities are used
  • Most extensions extend the default capabilities
  • Active list is in /sys/kernel/security/lsm, in order of precedence
  • Any reduction in the number of available system calls reduces the attack surface
  • Filter function invocation in context of a resource

LSM/MAC example LSM/MAC event tree

For more information vist,

Preventive Control : Namespaces

Namespaces are a fundamental building block of containers on linux. Linux has a global namespace that contains all resources. New namespaces can restrict the view a process has to a subset of this global namespace.

Namespace example Resource visibility with namespaces

From /include/linux/nsproxy.h

/*
 * A structure to contain pointers to all per-process
 * namespaces - fs (mount), uts, network, sysvipc, etc.
 * [...]
 * The nsproxy is shared by tasks which share all namespaces.
 * As soon as a single namespace is cloned or unshared, the
 * nsproxy is copied.
 */
struct nsproxy {
	atomic_t count;
	struct uts_namespace *uts_ns;
	struct ipc_namespace *ipc_ns;
	struct mnt_namespace *mnt_ns;
	struct pid_namespace *pid_ns_for_children;
	struct net 	     *net_ns;
	struct cgroup_namespace *cgroup_ns;
};

Namespace + ownership example Namespace + ownership event tree

For more information vist,

Combined preventive control model

Full preventive security control model on the Ring 3 to Ring 0 gate

Any security tool that enforces policy around the userspace/kernelspace (Ring3/Ring0) boundary will leverage one or more of these kernel functions.

Detective Control : Audit facility

Detective controls can be part of risk mitigation but are not discussed here in that context. Our use case here centers on verifying the policies put in place and testing for regressions.

Audit logging of seccomp actions

Since Linux 4.14, the kernel provides the facility to log the actions returned by seccomp filters in the audit log. The kernel makes the decision to log an action based on the action type, whether or not the action is present in the actions_logged file, and whether kernel auditing is enabled (e.g., via the kernel boot option audit=1).

Source - man 2 seccomp

Full event model Full event tree with preventive and detective controls

From the full event tree we can see that it is possible to have visibility over all scenarios, except the kernel compromise through a vulnerable function. This is another reason that kernel compromise stands out as a high risk. In all other scenarios policy enforcement is either working, misconfigured or absent. None of these corrupt the integrity of the kernel, which means that the kernel audit facility will work as intended.

Further reading