One lingering worry with containerization is security. Previously, with conventional type 0 and type 1 (native, bare-metal) hypervisor technology, we greatly limited our trusted based to small hypervisors (e.g., Xen is < 150kloc). Some were so small (seL4 core was 7.5kloc) that they were amenable to mechanized formal verification. OSes supporting containers, in contrast, are much larger. Even CoreOS, intended as a slimmed down version of the Chrome OS Linux kernel that just supports modern bare-metal architectures for containers, that is fundamentally more challenging to vet, not to mention verify, than a simple hypervisor. Etcd and fleet alone add up to 44k sloc of Go. So for all the great inroads we were making in verification, the move towards containerization in the data center brings new challenges and potentially resets some of the progress the community has made in mechanically verifying security and functional correctness of the lowest layers of software systems and infrastructure.
Simply put, the security of containers (i.e., operating system-level virtualization) relies on Linux kernel-imposed isolation through cgroups and associated work in Linux kernel namespace isolation (including pids, devices, file system, IPC, and users). Cgroups was originally intended for single process memory, I/O bandwidth limitation, and CPU prioritization (think UNIX nice). The main mechanism cgroups uses to expose the creation and maintenance of cgroups is instances of a virtual file system that is accessible in user space. Therein lies the challenge, since cgroups is ultimately a kernel service that potentially exposes kernel internals to the mercy of user space including container processes since containers can issue arbitrary syscalls to the kernel. The cgroups accounting architecture itself is supposed to be hierarchical, so children of container processes ought not to be able to exceed the resource allocations to parents.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.