In my past few posts, I have focused on fault tolerant distributed systems as implemented through cluster managers. Apache Mesos, Kubernetes, and many others all attempt to support fault tolerance by auto-restarting and other self-healing techniques at the cluster manager level. As such, they rightly claim that they are the new operating systems of the cloud. It turns out, however, cluster managers certainly do not have a monopoly on fault tolerance features. Long before Mesos, Kubernetes, and possibly even University of Wisconsin's Condor, a distributed processing system with considerable more pedigree, Erlang had supervisor trees and supervisor behaviors (a kind of language interface) in the runtime thus supporting large, highly fault tolerant distributed systems decades ago.
A supervisor tree consists of supervisor and worker processes where supervisors themselves may have supervisors (i.e., a supervisor can be over both subordinate supervisors or workers). Erlang supervisors three process restart strategies:
- one-for-one: when a process fails or quits, it is restarted
- one-for-all: when a child process terminates, all its sibling processes are terminated and restarted
- rest-for-one: when a child process terminates, all younger siblings (i.e., sibling processes that started after) are terminated and the original child process and its younger siblings are restarted
- permanent: which are always-on services
- temporary: should not restart under any circumstance
- transient: should restart only for abnormal termination
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.