Errors in Database Systems Still Must Consider Network Partitions

Prof. Michael Stonebraker wrote a paper, published in the April 2010 issue of CACM, entitled “Errors in Database Systems: Eventual Consistency and the CAP Theorem”. As I see it, the overall point of the paper is that the kinds of failures that cause partition-tolerance problems are rare, and not too significant compared to the other ways that a DBMS can fail. Therefore, people are worrying too much about the CAP issue, specifically network partitions.

The paper enumerates eight causes of DBMS failure as seen by an application, such as software errors in the DBMS itself, operating system errors, and so on. Number six in his list is “a network partition in a local cluster”, and number eight is “a network failure in the WAN connecting clusters together; the WAN failed and clusters can no longer all communicate with each other”. (The usual reason one would have multiple clusters connected by a WAN is for “disaster recovery”, i.e. dealing with a problem that causes an entire cluster to fail, such as power loss over all hardware on which the cluster depends.)

Number six is the crucial issue in the paper, as far as the CAP issue goes. He has two answers:

  1. In my experience, this is exceedingly rare, especially if one replicates the LAN (as Tandem did).
  2. The overwhelming majority [of local failures] cause a single node to fail, which is a degenerate case of a network partition that is easily survived by lots of algorithms.

About #1, I have spent some time talking to the operations architects at ITA Software, which has run high-availability servers for many years now, and heard about their experience. It depends on what you mean by a “LAN”. If you mean a few computers connected together by an Ethernet, with redundant hardware all around, then the chance of a failure of the network itself is relatively low. However, real-world data centers with a relatively large number of servers rarely work this way. The problem is that a real network is very complicated. It depends on switches at both level 3 (routers) and level 2 (hubs). Situations can arise in which pieces of the network are mis-configured by accident; these can be hard to find due, ironically, the very redundancy that was added to avoid failures. In particular, there is no way to make any kind of guarantee about the latency within the network, nor the likelihood that a packet will make it from its source to its destination.

About #2, it would have been helpful if there were citations in the paper. It is hard to reply to such a claim without specifics. One of the techniques to deal with one server being down involved “quorums”, but they can introduce problems with high-availability.

But, more importantly, consider some of the failure modes that Amazon’s “Dynamo” highly-available key-value store is built to deal with. Suppose we have a key-value pair that resides on two servers, for high availability. Call them A1 and A2. An application changes the value of the pair, but does so at a time with A1 is down or unreachable. So the update is made to replica A2. Later, a second application reads the value associated with the key, but this time server A1 is down or unreachable, and server A2 is available. The second application might not see the new value written by the first application. I don’t know any of “lots of algorithms” that deals with this sort of scenario while providing complete consistency/correctness.

The conclusion of the paper is: “In summary, one should not throw out the C so quickly, since there are real error scenarios where CAP does not apply and it seems like a bad tradeoff in many of the other situations.” But since network-partition failures really can happen, it’s not clear that one can simply decide not to throw out the consistency/correctness criterion.