Archive for the ‘High Availabilty’ Category

Improving the PACELC Taxonomy

Wednesday, January 12th, 2011
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

Daniel Abadi of Yale published a blog article last April criticizing the characterization of distributed storage systems using “you only get two of C, A, and P”. He proposed a new taxonomy with the acronym PACELC, to be read “If P, then trade off A for C; if E, trade off L for C”, with this meaning:

  • When there is a partition, how does the system trade off between:
    • Availability and
    • Consistency
  • Else, when there are no partitions, how does the system trade off between:
    • Latency and
    • Consistency

For example, a system characterized as PA/EL means that in the face of partitions, it favors availability over consistency, and if everything’s working, it favors low latency over consistency.

I think this is moving very much in the right direction, and I hope I can contribute and help develop these ideas a bit.

Problems with the “Proof of the CAP theorem”

The “CAP” characterization has a lot of problems. It is especially poorly applied, if not actually misused, when someone trots out the “proof of the CAP theorem” to show how they were forced into a tradeoff. While the proof is correct, what is proves is too crude to model what we really care about.

I discussed the proof in an earlier post. In the proof, each attribute is problematic:

  1. “Consistency” means the behavior that would have happened on a single server that never crashed and did each operation in serial, which is fine, but lack of consistency means that the system makes no guarantees or representations about the result of an operation whatsoever.
  2. “Available” means that you get an answer “eventually”, but since eventually can mean any amount of time (a trillion years), there’s no practical difference between A and not A.
  3. “Partition-tolerance” is never actually defined in the paper.

“E” implies “A”

The reason the acronym doesn’t need to be “PACELCA” is that if there are no partitions, then the system must be available. Adding an “A” to the second part is redundant. But for me (maybe not for you), putting in the redundant “A” in the “E” case helps me. A PA/EL system is always “available”, and calling it PA/ELA makes it easier for me to see that availability is always there.

How do Availability and Latency relate?

Consider what “highly available” and “low latency” mean. They are not entirely distinct and orthogonal. The only useful meaning of “A” is that the system replies within a maximum latency. It could be something like “response within 10ms at least 90% of the time and within 100ms in any case” rather than a simple deadline. We can call this “fast enough” to meet the system requirements. So availability is about latency.

There is, however, an important practical difference. “Available” refers to a system’s latency related to the amount of time it takes to repair a partition.

To see this, consider two web sites (with human users) that are based on a system that can have partitions:

  • The operators of the system move so quickly that they always fix partitions within 10ms. The system is “available” even in the face of any single partition, without any special mechanism to be “partition tolerant”.
  • The operators of the system move so slowly that it takes them five minutes to fix a partition. If the system has no way to be “partition tolerant”, it’s not available.

Latency (the “L” in PACELC) has nothing to do with repair time, since it only applies when there are no partitions. A web site is far better with a (maximum/average/whatever) latency of 10ms than with 1000ms.

So “A” and “L” are different. But, that said, even if a system meets its “A” (fast enough) requirement, it can be valuable to lower the latency below that requirement. The “PAC” characterization does not take this into account.

PC/EL is confusing

If a system is consistent when there are partitions, then surely it’s also consistent when there aren’t any partitions. If the components work better, the service should not be worse.

At first glance, this seems to mean that “if PC, then EC”. That would mean that PC/EL can’t describe any realistic system, but Prof. Abadi characterizes PNUTS/Sherpa (as originally presented). I’m sure that there isn’t really a paradoxical situation with any real system, but rather that there is a way to misinterpret the PACELC notation. What do PC and EL really mean?

PC means that if a client sends a request when there are partitions that prevent the system from answering promptly and correctly, then the system does not answer, rather than providing an answer that might be incorrect. Indeed, it might not be able to reply at all, since a total failure is a kind of partition, and there just isn’t anybody to send back a reply.

EL means that if a client sends a request, and the system can choose between waiting a longer time to send a consistent answer, versus waiting a shorter time to send an inconsistent answer, it chooses (or tilts toward) the latter.

Loose ends

  • What does “C” really mean? Can’t we say something better than “we don’t guarantee consistency”? Dynamo can give you answers that are not definitive but are very useful, with semantics that the application can understand. What about “eventual consistency”?
  • What about durability? There’s a big difference between some data being temporarily offline versus data being lost forever. Some systems use “commits over a WAN” to replace the use of disks, and then the tradeoff of latency versus correctness, from synchronous to asynchronous commits, is important.
  • Should be distinguish between “available for read” vs. “available for write”? This can come up in, e.g., a master-slaves configuration.
  • Stay tuned.

    NoSQL Storage Systems Never Violate ACID. Never? Well, Hardly Ever!

    Tuesday, September 7th, 2010
    news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

    Everybody agrees that the new “NoSQL” storage systems “aren’t ACID”, or “don’t have transactions”.  This is true <i>in a sense</i>, but without knowing the sense, it doesn’t tell you much.

    In one sense, they <i>do</i> have transactions that are limited to having one operation per transaction.  One operation could mean reading, writing, incrementing, or doubling the value associated with a particular key.  For example, look at an “insert” operation in a key/value store.  An operations acts on only one data object.  Are these single-operation transactions ACID?  Let’s check each criterion:

    A means “atomic”: either all the operations happen, or none of them happens.  Well, there’s only one operation.  The key-value store <i>does</i> guarantee that either the insert happens, or it doesn’t.  So the transaction atomic.

    C means “consistent”.  In relational database systems, people use this to mean that various interesting consistency guarantees are maintained.  But here, we don’t have to worry about such things as referential integrity, since there are no references to have integrity; that is, there are no foreign keys.  So it’s consistent.

    I means “isolated”: concurrency is never seen by the application.  The system behaves as if each operation happened at a particular, distinct moment in time.  The key-value stores all make this guarantee.

    D means “durable”: before the application is told that the transaction has been completed successfully (i.e. committed), any side-effects it does are in stable storage so that if a node stops (such as a crash of a process or a whole node) won’t lose the results of the side-effects.  Here, a transaction is only one operation, but that doesn’t change anything: the system does provide “durability”.  (Some systems might cheat by not actually forcing data to stable storage, but we’re not talking about those.)

    So it appears to be ACID!  OK, something has <i>got</i> to be wrong here, right?

    Right.  Where I tried to pull the wool over your eyes is the definition of “C”.  “C” doesn’t just mean conforming to the databases integrity constraints.  It means that the system returns the correct answer! That is, response to any operation is consistent with some state that the database could be in.  There’s more than one such state when there are concurrent operations going on, which might be ordered in more than one way, depending on how the concurrency system works.  So it’s clearer to think of “C” as meaning “correct”.  (In the famous Gilbert and Lynch paper that “proves the CAP theorem”, that’s what they mean by “C”.)

    The “NoSQL” storage systems are guaranteed return the correct answer <i>only</i>if there are no partitions in the network.  But if there are (or were, e.g. at write time) partitions, they can return things like “two replicas say the value is X, but another replica says that the answer is Y”, and the application has to try to make sense of and cope with that.  That is <i>not</i> “C”.  This is usually called “eventually consistency”: if the partitions were to eventually heal and the system deferred accepting new operations until all the in-progress operations finished, and something went over the whole database to fix up any inconsistencies that happened during writes, then the system would become fully consistent, and would be behave correctly until the next partition.

    that there are at least two nodes that cannot send messages between each other.  It’s important to know that if a node in your your system is down, that’s considered a partition: it’s as if this node were disconnected from the network.

    The “NoSQL” systems are ACID, as long as you accept that a transaction can only perform one operation, in the sense that the only thing that gets in the way of being ACID is when there are network partitions and the system is called upon to perform operations while the partition is still there.

    “Partition” is a somewhat slippery concept that I will examine in an upcoming separate essay.  But the basic ides is that a it means that there are at least two nodes that cannot send messages between them.  It’s important to know that if a node in your your system is down, that’s considered a partition: it’s as if this node were disconnected from the network.

    This also shows that the name “NoSQL” doesn’t explain everything that’s important about these systems.  But you can’t pack a whole lot into a short, punchy name, so I’m not really complaining.  ( do the same thing with the names of my blog essays; <i>mea culpa<i>.  You just have to keep in mind that the lack of SQL is not the only important thing.

    Air New Zealand’s outage last fall

    Friday, July 30th, 2010
    news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

    I recently was reminded that  Air New Zealand’s airline reservation system went out of service on Sunday morning at about 9:30 am, October 11, 2009.  This story is very interesting to me, since my team is building just such a system (with a very different underlying implementation).

    IBM said that the outage happened because of a power failure at an IBM data center in Newton that took out their mainframe.  Many existing airline reservation systems run on a single IBM mainframe; mainframes are known for being rock-solid reliable, but not without electricity! IBM said it was caused by a failed oil pressure sensor on a backup generator.  What’s more, the problem happened during a scheduled maintenance session!

    The outage affected more than 10,000 passengers, leaving airports “in disarray”.  Most systems were restored around 1.30 pm [four hours later], but the passenger backlog did not start to clear until self check-in kiosks were up and running again about 3.30pm [six hours later]. Air New Zealand was, to put it mildly, furious.

    As usual, people never (well, hardly ever) adequately test their redundant backup technology! In particular, they should have also used the generators for a long enough time to test for this kind of failure.  I recently heard about another such pr0blem, in a discussion at ITA, where the backup generators worked but didn’t work for long enough.  (At least I think it was a distinct case, since I heard it a while ago, but I could be wrong.)

    You must do these tests reasonably frequently, since things can break over time, even if they are merely lying in wait. I plan to write more about this in a future blog post.

    I don’t know, but I’ve been told: most, if not all, airlines do not actually have disaster recovery setup that would switch over to a geographically distant site.  Evidently airlines are surprisingly “penny wise and pound foolish” when it comes to redundant components, which they are loath to pay for.  (I think we were talking about network connections but it’s too long ago for me to remember clearly.  The same principle applies across the board.)

    Afterward, apparently IBM’s main job was to grovel.  Air New Zealand, in the person of CEO Rob Fyfe, said in strong language that IBM took a long time to react, accept responsibility, and apologize.  He called IBM “amateur”, which is quite an insult for IBM, and that his IT team was looking for alternative suppliers already.  (I don’t know how that turned out.)

    IBM did apologize by the evening of the next day, and said they “immediately engaged a team of 32 local IT professionals, supported by global colleagues. This means Mr. Fyfe considers two working days to be a very long time for such an apology.  Perhaps he was manly putting on a public show of anger, actually intended more for his customers and shareholders than for IBM.  But I don’t think that actually matters, from IBM’s point of view.  As someone participating in a team building such a system, that’s the point of view that I am most concerned with.

    (By the way, back at MIT in the late 1970′s, when a guy from Digital Equipment showed up for “preventive maintenance” on one of our timesharing systems (removable disks on the MIT-MC KL/10), we called it “causitive maintenance”.  He once made a mistake that caused a lot of trouble.)

    I don’t just mean this to rag on IBM.  Making systems that are working fully 24/7 is quite difficult and expensive.  Our team can perhaps learn a little bit from this: if nothing else, it is one more data point about the cost/benefit of system failure for an airline reservation system.  When airlines talk about high availability, they are not kidding!