Intel/Linux process isolation and containment
The Linux process model
We’ll take a traditional multi-user Linux environment as a starting point. In this scenario, the provider runs the hardware, the operating system and supporting system functions.
It turns out that on Linux processes cannot actually do much beyond compute operations on data in their own memory space. In order to do anything outside of that a process has to request that the kernel perform an operation on its behalf. This includes operations such as,
- Starting a new process
- Listing the active processes on the system
- Reading data from a file
- Listening on a network port
- Opening a connection to a remote network port
There is no direct communication between processes. As we are interested in the lines of communication, our depiction of the relationship between processes and kernel takes that into account.
The Linux kernel leverages constructs in the CPU to ensure that processes cannot perform these actions directly. The CPU that the system is running on is able to run instructions at different privilege levels. For Intel based processors these are generally referred to as rings. The Linux kernel itself runs in Ring 0, which is a privileged domain.
A note on terminology : Linux documentation will usually not refer to CPU privilege levels directly. Linux refers to User space which is Ring 3 (unprivileged) and Kernel space which is Ring 0 (privileged). Processes run in user space and the kernel runs in kernel space.
For example, some operations that require Ring 0 privileges are,
- Run all instructions
- Write to all registers
- Modify current segment register
- Modify page tables (CR3 register)
- Register interrupt handlers
- Use IO instructions
These instructions and capabilities give the kernel full control over the system.
If we try to execute any of these as a process (that is, while running in User space / Ring 3), the CPU will trap our code (halt execution) and hand over control to the kernel to decide what needs to be done. Usually the kernel will terminate the process, log the event and optionally dump the whole memory image of the offending process to disk.
That is what has happened when you see this message come by,
Segmentation fault (core dumped)
We need our processes to open files, etc though and we can ask the kernel to execute these instructions on our behalf but there are a few steps to doing this.
The mechanism for calling the kernel is to trigger a privilege change - i.e. code executing in Ring 3 has to inform the CPU that it wants to hand over control to code in Ring 0. This is generally done through a special instruction called a system call.
With this knowledge we can re-draw the relationship of processes.
Note on privilege changes
Interrupts can also trigger a privilege change. An interrupt is an event that is generated (or raised) by software or hardware. A hardware interrupt can signal, for example, the arrival of a packet on a network interface. A software interrupt is raised by executing the int
instruction.
- Interrupt handler
0x80
(128) is the linux legacy syscall handler. Calling an interrupt transfers control to a ring 0 interupt handler. The kernel returns with theiret
instruction - 32-bit mode fast system calls use
sysenter
to call into ring 0 andsysexit
to return to ring 3 - 64-bit mode fast system calls use
syscall
to call into ring 0 andsysret
to return to ring 3
Collectively we will refer to these as the System Call Gate.
For more information,
Policy Controls on the Linux System Call Gate
So far we know that,
- Processes cannot communicate with other processes directly
- Processes cannot access resources directly
- Performing either of these can only be done with privileged instructions
- We can only access privileged instructions indirectly via system calls
That makes the System Call Gate our our primary attack surface.
We can find out by the area of that surface by running the command man 2 syscalls
. It lists over 400 documented system calls. (We can always refer to the kernel source to find out how accurate this list is).
SYSCALLS(2) Linux Programmer's Manual SYSCALLS(2)
NAME
syscalls - Linux system calls
DESCRIPTION
The system call is the fundamental interface between an application and
the Linux kernel.
[...]
System call Kernel Notes
─────────────────────────────────────────────────────────────────────
accept(2) 2.0 See notes on socketcall(2)
accept4(2) 2.6.28
access(2) 1.0
acct(2) 1.0
add_key(2) 2.6.10
[...]
Preventive Control : Permissions, ACLs and Capabilities
This control is only applied once the system call function is already executing. It is also resource-based, which implies that the resource itself, or its metadata, has to be accessed in order for the control to be evaluated. In addition, Capabilities can be set on processes that allow this control to be bypassed entirely. The capability implementation itself also varies across capabilities and resources, increasing the likelihood of both bugs and evaluation errors.
The combination of those factors makes this a relatively weak control.
For more information,
man 7 capabilities
- redhat.com - Container security guide / Linux Capabilities and SECCOMP
- packagecloud.io - The definitive guide to Linux system calls/
Preventive Control : Secure Computing Mode (SECCOMP)
Notes,
- SECCOMP has two modes - 1 (strict) or 2 (filter)
- Strict mode only allows a process to call
read(2)
,write(2)
,_exit(2)
andsigreturn(2)
- Filter mode allows system calls based on a pointer to a BPF program
- Strict mode only allows a process to call
- A firewall for syscalls
- Any reduction in the number of available system calls reduces the attack surface
- Filter function invocation without resource context
Example : PTRACE_TRACEME Kernel Privilege escalation
This example is a bug in a system call (ptrace(2)
). These happen occasionally. The kernel is under continuous development and is written in a language that does not have any built-in guarantees of safety.
In the Linux kernel before 5.1.17, ptrace_link in kernel/ptrace.c mishandles the recording of the credentials of a process that wants to create a ptrace relationship, which allows local users to obtain root access by leveraging certain scenarios with a parent-child process relationship, where a parent drops privileges and calls execve (potentially allowing control by an attacker).
With unrestricted access to the kernel system call interface, this vulnerability can be exploited by any process on the system to execute code with ring 0 privileges. For example,
Unprivileged containers run in Kubernetes and OpenShift 4 clusters do not use seccomp filtering by default and can use the ptrace() syscall to exploit this vulnerability.
For more information vist,
Preventive Control : Linux Security Modules
Linux Security Modules (LSM) is a framework in the Linux kernel that provides hooks from the kernel to kernel modules for the enforcement of one or more policies. Some of these modules are maintained as part of the Linux kernel source (SELinux, AppArmor, Smack), others are maintained separately. The hooks block execution for policy evaluation when sensitive functions are accessed.
“[..] the LSM framework is primarily focused on supporting access control modules, [..]. By itself, the framework does not provide any additional security; it merely provides the infrastructure to support security modules.”
Notes,
- Built into the kernel
- Used by MAC extensions (e.g AppArmor, SELinux)
- With no explicit LSM built-in, default capabilities are used
- Most extensions extend the default capabilities
- Active list is in
/sys/kernel/security/lsm
, in order of precedence - Any reduction in the number of available system calls reduces the attack surface
- Filter function invocation in context of a resource
For more information vist,
- wikipedia.org - Linux Security Modules
- kernel.org - Security / LSM
- kernel.org - Admin / LSM
- kubernetes.io - AppArmor
Preventive Control : Namespaces
Namespaces are a fundamental building block of containers on linux. Linux has a global namespace that contains all resources. New namespaces can restrict the view a process has to a subset of this global namespace.
From /include/linux/nsproxy.h
/*
* A structure to contain pointers to all per-process
* namespaces - fs (mount), uts, network, sysvipc, etc.
* [...]
* The nsproxy is shared by tasks which share all namespaces.
* As soon as a single namespace is cloned or unshared, the
* nsproxy is copied.
*/
struct nsproxy {
atomic_t count;
struct uts_namespace *uts_ns;
struct ipc_namespace *ipc_ns;
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net *net_ns;
struct cgroup_namespace *cgroup_ns;
};
For more information vist,
- wikipedia.org - Linux_namespaces
- LWN - 2001 - Per-process namespaces
- LWN - 2007 - PID namespaces in the 2.6.24 kernel
- LWN - 2012 - User namespaces progress
- LWN - 2013 - (7-part series) Namespaces in operation, part 1: namespaces overview
Combined preventive control model
Any security tool that enforces policy around the userspace/kernelspace (Ring3/Ring0) boundary will leverage one or more of these kernel functions.
Detective Control : Audit facility
Detective controls can be part of risk mitigation but are not discussed here in that context. Our use case here centers on verifying the policies put in place and testing for regressions.
Audit logging of seccomp actions
Since Linux 4.14, the kernel provides the facility to log the actions returned by seccomp filters in the audit log. The kernel makes the decision to log an action based on the action type, whether or not the action is present in the actions_logged file, and whether kernel auditing is enabled (e.g., via the kernel boot option audit=1).
Source -
man 2 seccomp
- selectel.com - Auditing linux system events/
- archlinux.org - Audit framework
- redhat.com - Security / CHAP system auditing
- redhat.com - Security / Audit
From the full event tree we can see that it is possible to have visibility over all scenarios, except the kernel compromise through a vulnerable function. This is another reason that kernel compromise stands out as a high risk. In all other scenarios policy enforcement is either working, misconfigured or absent. None of these corrupt the integrity of the kernel, which means that the kernel audit facility will work as intended.