Archive for the ‘Database’ Category

Improving the PACELC Taxonomy

Wednesday, January 12th, 2011
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

Daniel Abadi of Yale published a blog article last April criticizing the characterization of distributed storage systems using “you only get two of C, A, and P”. He proposed a new taxonomy with the acronym PACELC, to be read “If P, then trade off A for C; if E, trade off L for C”, with this meaning:

  • When there is a partition, how does the system trade off between:
    • Availability and
    • Consistency
  • Else, when there are no partitions, how does the system trade off between:
    • Latency and
    • Consistency

For example, a system characterized as PA/EL means that in the face of partitions, it favors availability over consistency, and if everything’s working, it favors low latency over consistency.

I think this is moving very much in the right direction, and I hope I can contribute and help develop these ideas a bit.

Problems with the “Proof of the CAP theorem”

The “CAP” characterization has a lot of problems. It is especially poorly applied, if not actually misused, when someone trots out the “proof of the CAP theorem” to show how they were forced into a tradeoff. While the proof is correct, what is proves is too crude to model what we really care about.

I discussed the proof in an earlier post. In the proof, each attribute is problematic:

  1. “Consistency” means the behavior that would have happened on a single server that never crashed and did each operation in serial, which is fine, but lack of consistency means that the system makes no guarantees or representations about the result of an operation whatsoever.
  2. “Available” means that you get an answer “eventually”, but since eventually can mean any amount of time (a trillion years), there’s no practical difference between A and not A.
  3. “Partition-tolerance” is never actually defined in the paper.

“E” implies “A”

The reason the acronym doesn’t need to be “PACELCA” is that if there are no partitions, then the system must be available. Adding an “A” to the second part is redundant. But for me (maybe not for you), putting in the redundant “A” in the “E” case helps me. A PA/EL system is always “available”, and calling it PA/ELA makes it easier for me to see that availability is always there.

How do Availability and Latency relate?

Consider what “highly available” and “low latency” mean. They are not entirely distinct and orthogonal. The only useful meaning of “A” is that the system replies within a maximum latency. It could be something like “response within 10ms at least 90% of the time and within 100ms in any case” rather than a simple deadline. We can call this “fast enough” to meet the system requirements. So availability is about latency.

There is, however, an important practical difference. “Available” refers to a system’s latency related to the amount of time it takes to repair a partition.

To see this, consider two web sites (with human users) that are based on a system that can have partitions:

  • The operators of the system move so quickly that they always fix partitions within 10ms. The system is “available” even in the face of any single partition, without any special mechanism to be “partition tolerant”.
  • The operators of the system move so slowly that it takes them five minutes to fix a partition. If the system has no way to be “partition tolerant”, it’s not available.

Latency (the “L” in PACELC) has nothing to do with repair time, since it only applies when there are no partitions. A web site is far better with a (maximum/average/whatever) latency of 10ms than with 1000ms.

So “A” and “L” are different. But, that said, even if a system meets its “A” (fast enough) requirement, it can be valuable to lower the latency below that requirement. The “PAC” characterization does not take this into account.

PC/EL is confusing

If a system is consistent when there are partitions, then surely it’s also consistent when there aren’t any partitions. If the components work better, the service should not be worse.

At first glance, this seems to mean that “if PC, then EC”. That would mean that PC/EL can’t describe any realistic system, but Prof. Abadi characterizes PNUTS/Sherpa (as originally presented). I’m sure that there isn’t really a paradoxical situation with any real system, but rather that there is a way to misinterpret the PACELC notation. What do PC and EL really mean?

PC means that if a client sends a request when there are partitions that prevent the system from answering promptly and correctly, then the system does not answer, rather than providing an answer that might be incorrect. Indeed, it might not be able to reply at all, since a total failure is a kind of partition, and there just isn’t anybody to send back a reply.

EL means that if a client sends a request, and the system can choose between waiting a longer time to send a consistent answer, versus waiting a shorter time to send an inconsistent answer, it chooses (or tilts toward) the latter.

Loose ends

  • What does “C” really mean? Can’t we say something better than “we don’t guarantee consistency”? Dynamo can give you answers that are not definitive but are very useful, with semantics that the application can understand. What about “eventual consistency”?
  • What about durability? There’s a big difference between some data being temporarily offline versus data being lost forever. Some systems use “commits over a WAN” to replace the use of disks, and then the tradeoff of latency versus correctness, from synchronous to asynchronous commits, is important.
  • Should be distinguish between “available for read” vs. “available for write”? This can come up in, e.g., a master-slaves configuration.
  • Stay tuned.

    Errors in Database Systems Still Must Consider Network Partitions

    Tuesday, December 14th, 2010
    news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

    Prof. Michael Stonebraker wrote a paper, published in the April 2010 issue of CACM, entitled “Errors in Database Systems: Eventual Consistency and the CAP Theorem”. As I see it, the overall point of the paper is that the kinds of failures that cause partition-tolerance problems are rare, and not too significant compared to the other ways that a DBMS can fail. Therefore, people are worrying too much about the CAP issue, specifically network partitions.

    The paper enumerates eight causes of DBMS failure as seen by an application, such as software errors in the DBMS itself, operating system errors, and so on. Number six in his list is “a network partition in a local cluster”, and number eight is “a network failure in the WAN connecting clusters together; the WAN failed and clusters can no longer all communicate with each other”. (The usual reason one would have multiple clusters connected by a WAN is for “disaster recovery”, i.e. dealing with a problem that causes an entire cluster to fail, such as power loss over all hardware on which the cluster depends.)

    Number six is the crucial issue in the paper, as far as the CAP issue goes. He has two answers:

    1. In my experience, this is exceedingly rare, especially if one replicates the LAN (as Tandem did).
    2. The overwhelming majority [of local failures] cause a single node to fail, which is a degenerate case of a network partition that is easily survived by lots of algorithms.

    About #1, I have spent some time talking to the operations architects at ITA Software, which has run high-availability servers for many years now, and heard about their experience. It depends on what you mean by a “LAN”. If you mean a few computers connected together by an Ethernet, with redundant hardware all around, then the chance of a failure of the network itself is relatively low. However, real-world data centers with a relatively large number of servers rarely work this way. The problem is that a real network is very complicated. It depends on switches at both level 3 (routers) and level 2 (hubs). Situations can arise in which pieces of the network are mis-configured by accident; these can be hard to find due, ironically, the very redundancy that was added to avoid failures. In particular, there is no way to make any kind of guarantee about the latency within the network, nor the likelihood that a packet will make it from its source to its destination.

    About #2, it would have been helpful if there were citations in the paper. It is hard to reply to such a claim without specifics. One of the techniques to deal with one server being down involved “quorums”, but they can introduce problems with high-availability.

    But, more importantly, consider some of the failure modes that Amazon’s “Dynamo” highly-available key-value store is built to deal with. Suppose we have a key-value pair that resides on two servers, for high availability. Call them A1 and A2. An application changes the value of the pair, but does so at a time with A1 is down or unreachable. So the update is made to replica A2. Later, a second application reads the value associated with the key, but this time server A1 is down or unreachable, and server A2 is available. The second application might not see the new value written by the first application. I don’t know any of “lots of algorithms” that deals with this sort of scenario while providing complete consistency/correctness.

    The conclusion of the paper is: “In summary, one should not throw out the C so quickly, since there are real error scenarios where CAP does not apply and it seems like a bad tradeoff in many of the other situations.” But since network-partition failures really can happen, it’s not clear that one can simply decide not to throw out the consistency/correctness criterion.

    What Is a “Better” Database System?

    Sunday, November 7th, 2010
    news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

    What makes one database management system better than another?

    Let’s compare two database system, A and B, that use the same data model (such as “relational”) and the same transaction types (such as “ACID”, or ACID with some reduced isolation level. How do we decide whether A is better than B?

    These days, the answer is, which one has better latency or throughput for a given scenario. A scenario is defined by the contents of the database and the particular queries (any kind of request) it is given. If you read the marketing literature for any commercial database system, that’s what it talks about. The scenarios could include new datatypes, streaming, and so on. But the metric that measures “better” is still speed.

    But is this right for all of today’s needs?

    An exciting paper by Daniela Florescu and Donald Kossmann, entitled Rethinking Cost and Performance in Database Systems, suggests a different way of looking at things: given a set of performance requirements, and consistency needs, what is the least expensive database system we can build that meets those parameters. (You can also find the paper here.)

    For example, sometimes latency (response time) only has to be fast enough that a human being won’t be bothered by having to wait that much longer. There’s no point in trying to get the latency “as low as possible” when 100 milliseconds is just fine.

    The paper was published in SIGMOD Record, a journal sent to ACM members who are in SIGMOD (the special interest group for Management of Data). These are typically researchers, or people who follow the research closely. The paper suggests to the entire database research community that they should consider changing their whole orientation in this way. If you are working on building a database management system, or interested in looking at some of the new databases and data stores, this new viewpoint is illuminating. I recommend the paper highly.

    ACID in Theory and Practice

    Friday, November 5th, 2010
    news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

    The new so-called NoSQL data stores have been criticized, often by the traditional database community, because they sacrifice “ACID transactions”. Is this fair? How much does it matter? I’ll briefly go over what ACID transactions are and what they’re for, and then look at how they’re used, or not.

    ACID

    A “transaction” works this like: a thread (locus of control) does the following steps:

    • A “begin transaction” operation
    • Arbitrary computation, which can include:
      • examining data
      • modifying data
    • Either a “commit transaction” operation, or an “abort transaction” operation.

    If the “commit transaction” operation completes (i.e. returns to its caller in the thread), the transaction is said to have committed. If the thread does an “abort transaction”, or if the thread halts (the thread gets an unhanded exception, the thread is killed, the process is killed, the hardware crashes), the transaction is said to have “aborted”.

    (In some systems, the “begin transaction” is implicit when the previous transaction completes; it doesn’t matter.)

    Ideally, a transaction has four properties, usually described with the helpful mnemonic “ACID”:

    Atomic: If a transaction modifies the data and the transaction commits, all of the changes are performed; if the transaction aborts, none of them happens.

    Consistent: There is some predicate on the data, called “consistency”. If the data is in a consistent state when the transaction has first started (before it performs any side-effects), then it is consistent after the transaction finishes. (This is trivially true if the transaction aborts.)

    Isolated: Although many threads of control might be examining or modifying the data concurrently (interleaved in time), everything behaves as if they were sequential, i.e one at a time in some order.

    Durable: If the transaction commits, any modifications it has made are “durable”, which means that they take effect even if there is a halt.

    “Consistency” is hard to define. What is really means it that the data in the database is an accurate representation of the real world (for example, account X has $A and account B has $B), and that the transactions that moved the database state from a before-state to an after-state are consistent with real world operations (e.g. money has been withdrawn from account X and deposited in account Y).

    Unfortunately, there isn’t any real way to check and enforce this. So what happens depends a lot on the application. Often what people mean by “consistency” is that certain invariants are met. Some database systems provide support for adding checks for these invariants, called “integrity constraint”. Meeting these constraints is necessary but not, in general, sufficient for consistency, but that’s often what people mean by “consistency”. Mostly people don’t pay much attention to “C”, anyway.

    If a data storage system is both Atomic and Durable together, then modifications made by a committed transaction are all performed on the database, even in the face of a halting failure. This plus Isolation presents the application with an abstraction that’s very clean and easy to deal with.

    Most important, ACID is entirely independent of the application. The concerns of the application are entirely separated from the concerns of failure and interleaving. This separation of concerns makes things much simpler, and reducing complexity is of great value.

    Isolation in Theory and Practice

    Writing an application is easy with isolation, because the programmer can ignore concurrency. But do people really use database systems this way? When we look around, we find the concept of an “isolation level”, in which an application can decide how much isolation it wants. Don’t they all want total isolation? Yes, but there’s a big problem: total isolation hurts performance severely in so many cases that it’s rarely used! If you don’t believe me, consider the following.

    Thomas Kyte has written widely about Oracle DB, especially about how its transactions work. His book, “Expert Oracle Database Administration”, was recommended to me by a skilled Oracle database administrator; Kyle is highly respected. Although Oracle DB can do ACID transactions, the book strongly recommends against using them. Oracle DB has more than one “isolation level”. The strongest, READ REPEATABLE, provides ACID transactions. (Almost. If you care about the “phantom read” issue, you don’t need me to tell you about this stuff.) Instead, he recommends that you use the READ COMMITTED isolation level. He says that it is “the most commonly used isolation level” and that “it is rare to see a different isolation level used.”

    When using READ COMMITTED, you are not guaranteed to get “repeatable reads”. That is, during the course of a transaction, you might read a value, and later read the same value and get a different result back, because of writes by concurrent transactions! Remember that a read is often not a direct request to read just one column of one row; it’s often part of a more general SQL query. You might not even know that there is some data that two SQL queries both read. This is not what I’d call the “I” in “ACID”. The concurrency, rather than being cleanly separated from the application, is now exposed to the application. The application writer has to know that reads are not repeatable and take that into account, which makes his or her life harder.

    This isn’t specific to Oracle. In fact, it’s so pervasive in relational databases that it’s even part of the SQL standard. Isolation levels are so important that they aren’t just an implementation-specific hack. The official SQL standard defines several reduced levels of isolation.

    Here’s another story about not using ACID. I and the rest of the ObjectStore team at Object Design once had the great opportunity to talk with some of the most renowned database experts in the world, at IBM’s Alamaden Research Center. These are the people who designed one of the earliest relational database systems (System/R), and continued to do groundbreaking work, which can you can read in many excellent papers they have published, many of which I had read. The group included Don Haderly, C. Mohan, Bruce Lindsay, and others, If these people don’t know about transactions, nobody does.

    When they heard us say that ObjectStore provided real ACID transactions, they were surprised, and explained to us that nobody really uses those. They said you mustn’t do that, or your database system will be too slow.

    They said, what our relational database applications use is “cursor-stability isolation”. Here’s how that one works. In a relational database, you typically perform a query, and get back a sequence of rows (a.k.a. tuples). The application iterates over the tuples, with a “cursor” to keep track of where in the sequence it’s up to. With “cursor-stable” isolation, when the cursor moves to a row, that row is locked. When it moves the cursor to the next row, the old row is unlocked and the next row is locked. At the end, the last row is unlocked and the transaction ends.

    I was very surprised. While the application is working with one row, all the other rows could change out from under it. If you were trying to sum up some column (attribute) of each row, you might not get a consistent snapshot of the database.

    For example, suppose each row represented a bank account, and an application A wants to transfer $100 from account A to account B. Concurrently, application B wants to sum up the total amount in all accounts. B should get the same answer no matter how it is interleaved with A. B should not see an inconsistent state where the debit has been done and the credit has not. I asked how an application e deal with such confusing behavior. This is a lot like the Oracle situation: reads are not repeatable.

    I was very surprised: how can the application writers be expected to deal with this lack of isolation? The answer went something like this:

    Summing up a column was really done in one SQL transaction using a SUM aggregate, and in that case the problem does not arise, because within a single SQL query, you do get isolated behavior. (This is true in Oracle as well.) Many common simple cases can be handled using SQL aggregation operators.

    Yes, it’s true that if you have more than one query in your transaction, the application programmer does have to be aware of possible effects of interleaving. However, in real life (they said), most transactions are simple enough that it’s not so hard to reason about the effects of reduced isolation, and sometimes you can just ignore them.

    To me, this was not a very satisfying answer. It’s like saying, well, it works in simple cases and when you’re lucky.

    In ObjectStore, there were data structures much more complicated than tables. Indeed, ObjectStore could store anything that you can express in your programming languages (C++ or Java). We didn’t see any way to something analogous. We got away with using ACID because the sweet spot for ObjectStore wasn’t applications doing fine-grained interleaving.

    Who Casts the First Stone?

    The ACID transaction abstraction provides an excellent separation of concerns. It’s true that the NoSQL stores, with their “eventual consistency” properties, or their “return many possibly-different values” API’s, force the application to live with weaker guarantees than ACID. But so do the real relational database systems. Academic papers or commercial white papers that criticize the NoSQL data stores for not providing ACID should be fair: in the real world, nobody who cares about fine-grained concurrency is providing ACID guarantees.

    Addendum of Nov 8, 2010

    One of ITA’s very knowlegable Oracle experts pointed out some things that some issues that I should have discussed.

    I should have mentioned that using Oracle’s “read committed” isolation, you do get repeatable reads within a single SQL query. When writing software that uses relational databases, it’s good to do as much as you can within a single query, rather than doing many queries as part of an imperative flow of control. All other things being equal, declarative code is better than imperative code. It is much easier for a person to reason about, which makes code clearer and easier to understand. Also, it makes code easier for a computer to understand. Writing an optimizer for imperative code is harder than writing one for declarative code.

    Our expert tells me that sometimes programmers, who are generally trained in, and experienced with, imperative coding, will sometimes write programs that do one query after another, when it could have been done in a single query. To be sure, to do it in a single query can require you to learn morea bout SQL. But if you’re using a relational database that uses SQL, you really ought to learn that stuff. If you are using a tool, you should learn to use it.

    Of course, not all situations allow you to take a transaction and make it only need one SQL statement. But if you can do that, you get transaction guarantees that are much closer to ACID. (It’s still not precisely ACID due to the so-called “phantom” scenario, but I will cut Oracle slack for that since it’s hard to solve in their architecture.)

    However, I’ll add that one of the criticisms of the “NoSQL” data stores that the relational experts make is that they can only do one operation in a query. While that is true, and it is a disadvantage, it’s also true that if you use Oracle, your transaction has better properties if it only performs one operation (query) per transaction. That’s not an apples-to-apples comparison (traditional RDBMS’s are capable of doing multi-query transactions, but it’s something to think about.

    OpenSQL Boston 2010 Takes Place This Weekend

    Wednesday, October 13th, 2010
    news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

    OpenSQL Camp Boston happens this weekend. It’s an unConference, which means anybody can give a talk and anybody can listen. There arew usually several parallel tracks. This is an unConference about open source databases, both relational and non-relational databases, database alternatives like “NoSQL stores”, and so on. There will be people from PostgreSQL, MySQL, MariaDB, VoltDB, Rackspace, InfoBright, BerkeleyDB, MIT, and others.

    The events are:

    • Friday Oct 15, at 6pm: social event at WorkBar Boston, 711 Atlantic Ave, Boston, MA
    • Saturday, Oct 16: unConference at the Stata Center
    • Saturday, Oct 17: more unConference at the Stata Center, ending 6:00 p.m.

    Click here for the full info.