Wednesday, December 17, 2014

Retry Pattern

A common design pattern in fault-tolerant distributed systems is the retry pattern. A given operation may experience a variety of failures:

  1. rare transient failures

    (e.g., corrupted packet) can be recovered from immediately and thus should retry immediately
  2. common transient failures

    (e.g., network busy) can retry after waiting for a period of time (possibly with exponential backoff
  3. permanent failure

    should not retry, bail out and clean up
Of course, the final case is that the operation succeeds and the function must do some work to address that. This is an interesting design pattern not only for distributed systems but also fault-tolerant systems in general. For example, high-performance JavaScript engines have parsers and tokenizers that must be robust to various failures. In fact, it is one example where large systems have used multiple exit points, more complicated control flow for which C-based programs may use gotos.

In the Mozilla Spidermonkey Tokenizer, the retry pattern is manifested in the following code:

bool getToken() {
   // setup
   retry:
   // main body
   // ... goto out
   // ... goto error
   // ... goto retry
   out:
   // normal exit
   error:
   // permanent failure
}

Because the Tokenizer is more concerned about tokenizing error rather than network socket errors, it does not have a notion of common versus rare transient failures. When the Tokenizer encounters an error, it is either permanent or rare transient. This implementation of the design pattern uses gotos because it does not assume the host language has support for exceptions. A similar implementation using standard exceptions would look like this:

bool getToken() {
   // setup
   for (int retryCount = 0; retryCount < MAX_RETRIES; retryCount++) {
      try {
      // main body
      // normal exit
      } catch (exception& e) {
         if (isTransient(e)) {
            continue;
         } else {
            // permanent failure cleanup
            break; // or throw;
         }
      }
   }

The advantage of this version is that it keeps lexical scoping throughout the code. It keeps values from the main body, out, and error blocks from accidentally being captured.

There is yet another way to manage the retry logic that is arguably more reusable and readable, especially in the case where multiple functions require the same retry logic.

variant 
getToken() {
    // main body
    // return rareTransient
    // return commonTransient
    // return permanent
    // normal exit
    // return success
} 

bool handler() {
   for (int retryCount = 0; retryCount < MAX_RETRIES; retryCount++) {
     auto res = getToken()
     if (success* s = boost::get(res)) {
        // normal exit
        break;
     } else if (rareTransient* t = boost::get(res)) {
        continue;
     } else if (commonTransient* t = boost::get(res)) {
        sleep(timeout);
        continue;
     } else if (permanent* p = boost::get(res)) {
        // cleanup
        break;
     }
}

Note that handler() can be parameterized with the function we want to do the retry logic on and replace the call to getToken() with a call to that parameter.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.