Archive for July, 2010

Air New Zealand’s outage last fall

Friday, July 30th, 2010
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

I recently was reminded that  Air New Zealand’s airline reservation system went out of service on Sunday morning at about 9:30 am, October 11, 2009.  This story is very interesting to me, since my team is building just such a system (with a very different underlying implementation).

IBM said that the outage happened because of a power failure at an IBM data center in Newton that took out their mainframe.  Many existing airline reservation systems run on a single IBM mainframe; mainframes are known for being rock-solid reliable, but not without electricity! IBM said it was caused by a failed oil pressure sensor on a backup generator.  What’s more, the problem happened during a scheduled maintenance session!

The outage affected more than 10,000 passengers, leaving airports “in disarray”.  Most systems were restored around 1.30 pm [four hours later], but the passenger backlog did not start to clear until self check-in kiosks were up and running again about 3.30pm [six hours later]. Air New Zealand was, to put it mildly, furious.

As usual, people never (well, hardly ever) adequately test their redundant backup technology! In particular, they should have also used the generators for a long enough time to test for this kind of failure.  I recently heard about another such pr0blem, in a discussion at ITA, where the backup generators worked but didn’t work for long enough.  (At least I think it was a distinct case, since I heard it a while ago, but I could be wrong.)

You must do these tests reasonably frequently, since things can break over time, even if they are merely lying in wait. I plan to write more about this in a future blog post.

I don’t know, but I’ve been told: most, if not all, airlines do not actually have disaster recovery setup that would switch over to a geographically distant site.  Evidently airlines are surprisingly “penny wise and pound foolish” when it comes to redundant components, which they are loath to pay for.  (I think we were talking about network connections but it’s too long ago for me to remember clearly.  The same principle applies across the board.)

Afterward, apparently IBM’s main job was to grovel.  Air New Zealand, in the person of CEO Rob Fyfe, said in strong language that IBM took a long time to react, accept responsibility, and apologize.  He called IBM “amateur”, which is quite an insult for IBM, and that his IT team was looking for alternative suppliers already.  (I don’t know how that turned out.)

IBM did apologize by the evening of the next day, and said they “immediately engaged a team of 32 local IT professionals, supported by global colleagues. This means Mr. Fyfe considers two working days to be a very long time for such an apology.  Perhaps he was manly putting on a public show of anger, actually intended more for his customers and shareholders than for IBM.  But I don’t think that actually matters, from IBM’s point of view.  As someone participating in a team building such a system, that’s the point of view that I am most concerned with.

(By the way, back at MIT in the late 1970′s, when a guy from Digital Equipment showed up for “preventive maintenance” on one of our timesharing systems (removable disks on the MIT-MC KL/10), we called it “causitive maintenance”.  He once made a mistake that caused a lot of trouble.)

I don’t just mean this to rag on IBM.  Making systems that are working fully 24/7 is quite difficult and expensive.  Our team can perhaps learn a little bit from this: if nothing else, it is one more data point about the cost/benefit of system failure for an airline reservation system.  When airlines talk about high availability, they are not kidding!

Boston: No More Silicon Valley Envy!

Wednesday, July 28th, 2010
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

A friend just sent me a link to a story by E. Douglas Banks entitled “Enough already: Get over the West Coast envy”. I agree with him: Bostonians feeling bad about “losing” to Silicon Valley have gone way overboard.  Banks is responding to an interview with Henry McCance of Greylock Partners, a venture capital firm that moved from the Boston area to Silicon Valley.  I want to amplify Banks’s comments.

Banks’s first two paragraphs are dead-on right.  I have heard that story over and over again, at many gatherings.  Usually, the comparison is between Silicon Valley and the Boston area.  The good part is, after talking about how Silicon Valley is doing so well compared to the Boston area, the speaker goes on to the positive part: a call to action, for all of us to work together to improve the Boston startup and tech ecosystem.  Henry McCance’s statement is largely lukewarm or worse about Boston.  And, as Banks point out, McCance pits the entire West coast, including the whole Seattle area, against Boston alone.  From now on, I’ll just say “here” to mean the Boston area, and “there” to mean Silicon Valley.

I don’t know, but I’ve been told: investors here are different from investor there.  But I’d use the word “careful” rather than “risk-averse”.  It’s not the same.  There is more and more seed funding around Boston.  Seed funding is very risky.  It’s just that you’re risking less money.  So startups can fail at the correct point, having not used up so much capital that providers of capital will be scared off.  If our VC’s and angel groups were reckless, they’d just all fail, and then where would we be?

Yes, there are a few very wealthy “super-angel” investors who evidently toss off cash with very little scrutiny of a startup, and that might also describe some wealthy high-tech founders.  It’s true that there are more there than here.  They have a larger “ecosystem”, and centers of high-tech are created by ecosystem’s positive feedback.  Being smaller isn’t all bad.  For example, we have a higher degree of interconnection.  But I would not claim that smaller is better.

McCance talks about Boston being poor at “spawning and sustaining great companies.” What, exactly, is this criterion?

He disparages the minicomputer companies, but look how long they lasted: Prime (20 years), Data General (31 years), Digital Equipment (41 years), and so on.  If these are supposed to be failures, how long does a company have to live to be a success?  50 years?  100 years?

Compare this with Google (12 years so far), Yahoo! (15 years but being overtaken), CNET (not the same kind of company), Facebook (6 years so far), MySpace (7 years so far and in decline, YouTube (5 years so far, no longer even exists as a company), LinkedIn (7 years so far) — you get the idea.  How sure can anyone be that they’ll last 50 or 100 years?  The industry changes even faster now than it used to.  All things must pass.

Banks mentioned Harmonix, Zipcar, iRobot, Carbonite, LogMeIn, Monster, Bose, and Staples.com.  In a comment, Scott Kirsner added A123 Systems, Akamai, Starent, ITA Software, EnerNOC, GlycoFi, Sirtris, and EqualLogic.

I’d immediately add two hugely successful companies: Cognex, the world leader in machine vision, going very strong for 29 years and MathWorks, going strong for 26 years.

Also: Ab Initio, Active Endpoints, Akiban, Apperian, Automation, CloudSwitch, Daily Grommet, Droid Works, Expressor, GateRocket, GenArts, Goby, Harvest Automation, Heartland Robotics, Hocoma, InterSystems, Kayak, Netezza, Nimbit, Progress Software, Rational (now part of IBM), Skyhook Wireless, Streambase, Timetrade Systems, Vela Systems, Vertica, VoltDB, and Xconomy.  This is just off the top of my head, so I apologize if I missed any of my friends’ companies!

There are plenty more, and you can find them in Xconomy and Boston Business Journal and Mass High Tech.

I’m an M.I.T. alumnus, and, as the Engineers Drinking Song says of the “small liberal arts college down the river”: “MIT was MIT when Harvard was a pup/And MIT will be MIT when Harvard’s time is up”.  But the Boston area get great engineers and entrepreneurs from Worcester Polytechnic, Rensselaer Polytechnic, Rochester Institute of Technology, Babson College, Bentley College, Boston University, and so on.

There are also many tech incubators such as TechStars, and several working spaces such as Betahouse and the Cambridge Innovation Center, where entrepreneurs can get office space and interact with other.

And everybody is working together to grow the ecosystem, help promote entrepreneurship, and establish even more “sustaining great companies”.  Silicon Valley, just you wait!

Sending Paper Cards

Sunday, July 25th, 2010
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

It’s fun to buy presents for friends. But what if it’s a friend whose home is already too cluttered? Or what if they have as many cool mugs, plates, vases, etc. as I need.

A solution? Cards! That is, cards in the sense of birthday cards, only blank ones that you can use for anything, including just a letter. They always come with suitable envelopes.

Consider all the benefits:

  • Good ones are not easy to find, but not so hard to find that you just get frustrated, so shopping for them is fun.
  • Choosing which card to buy, and which card would be good for whom, is a fun way to exercise your aesthetic sense and contemplate what would make your friends happy.
  • They are gifts, so they don’t clutter up your own house.
  • They’re inherently ephemeral, so they needn’t clutter up the recipient’s house.
  • Nearly anyone can find a use for a nice card. Their own friends appreciate getting real cards; they mean a lot more than email.
  • Often you are supporting real artists, sometimes even artists you meet in person at craft fairs selling their cards, or even people you know personally.

You can find interesting and pretty cards at fairs such as The Cambridge River Festival, or at book stores, especially independent book stores (which, in the face of web commerce, need all the help they can get these days).

What Does the Proof of the “CAP theorem” Mean?

Monday, July 12th, 2010
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

Several years back, Eric Brewer of U.C. Berkeley presented the “CAP conjecture”, which he explained in these slides from his keynote speech at the PODC conference in 2004. The conjecture says that a system cannot be consistent, available, and partition-tolerant; that is, it can have two of these properties, but not all three. This idea has been very influential.

Seth Gilbert and Nancy Lynch, of MIT, in 2002, wrote a now-famous paper called “Brewer’s Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services”. It is widely said that this paper proves the conjecture, which is now considered a theorem. Gilbert and Lynch clearly proved something, but what does the proof mean by “consistency”, “availability”, and “partition-tolerance”?

Many people refer to the proof, but not all of them have actually read the paper, thinking that it’s all obvious. I wasn’t so sure, and wanted to get to the bottom of it. There’s something about my personality that drives me to look at things all the way down to the details before I feel I understand. (This is not always a good thing: I sometimes lose track of what I originally intended to do, as I “dive down a rat-hole”, wasting time.) For at least a year, I have wanted to really figure this out.

A week ago, I came across a blog entry called “Availability and Partition Tolerance” by Jeff Darcy. You can’t imagine how happy I was to find someone who agreed that there is confusion about the terms, and that they need to be clarified. Reading Jeff’s post inspired me to finally read Gilbert and Lynch’s paper carefully and write these comments.

I had an extensive email conversation with Jeff, without whose help I could not have written this. I am very grateful for his generous assistance. I also thank Seth Gilbert for helping to clarify his paper for me. I am solely responsible for all mistakes.

I will now explain it all for you. First I’ll lay out the basic concepts and terminology. Then I’ll discuss what “C”, “A”, and “P” mean, and the “CAP theorem”. Next I’ll discuss “weak consistency”, and summarize the meaning of the proof for practical purposes.

Basic Concepts

The paper has terminology and axioms that must be laid out before the proof can be presented.

A distributed system is built of “nodes” (computers), which can (attempt to) send messages to each other over a network. But the network is not entirely reliable. There is no bound on how long a message might take to arrive. This implies that a message might “get lost”, which is effectively the same as taking an extremely long time to arrive. If a node sends a message (and does not see an acknowledgment), it has no way to know whether the message was received and processed or not, because either the request or the response might have been lost.

There are “objects”, which are abstract resources that reside on nodes. Objects can perform “operations” on other objects. Operations are synchronous: some thread issues a request and expects a response. Operations do not request other operations, so they do not do any messaging themselves.

There can be replicas of an object on more than one node, but for the most part that doesn’t affect the following discussion. An operation could “read X and return the value”, “write X”, “add X to the beginning of a queue”, etc. I’ll just say “read” for an operation that has no side-effects and returns some part of the state of the object, and “write” to mean an operation that performs side-effects.

A “client” is a thread running on some node, which can “request” an object (on any node) to perform an operation. The request is sent in a message, and the sender expects a response message, which might returns a value, and which confirms that the operation was performed. In general, more than one thread could be performing operations on one object. That is, there can be concurrent requests.

The paper says: “In this note we will not consider stopping failures, though in some cases a stopping failure can be modeled as a node existing in its own unique component of a partition.” Of course in any real distributed system, nodes can crash. But for purposes of this paper, a crash is considered to be a network failure, because from the point of view of another node, there’s no way to distinguish between the two. A crashed node behaves exactly like a node that’s off the network.

You might say that if a node goes off the network and comes back, that’s not the same as a crash because the node loses its volatile state. However, this paper does not concern itself with a distinction between volatile and durable memory.  There’s no problem with that; issues of what is “in RAM” versus “on disk” are orthogonal to what this paper is about.

Consistent

The paper says that consistency “is equivalent to requiring requests of the distributed shared memory to act as if they were executing on a single node, responding to operations one at a time.” They explain this more explicitly by saying that consistency is equivalent to requiring all operations (in the whole distributed system) to be “linearizable”.

“Linearizability” is a formal criterion presented in the paper “Linearizability: A Correctness Condition for Concurrent Objects”, by Maurice Herlihy and Jeannette Wing. It means (basically) that operations behave as if there were no concurrency.

The linearizability concept is based a model in which there is a set of threads, each of which can send an operation to an object, and later receive a response. Despite the fact that the operations from the different threads can overlap in time in various ways, the responses are as if each operation took place instantaneously, in some order. The order must be consistent with each thread’s own order, so that a read operation in a thread always sees the results of that thread’s own writes.

Linearizability per se does not include failure atomicity, which is the “A” (“atomic”) in “ACID”. But Gilbert and Lynch assume no node failures. So operations are atomic: they always run to completion, even if their response messages get lost.

So by “consistent” (“C”), the paper means that every object is linearizable. (That’s not what the “C” in “ACID” means, by the way, but that’s not important.) Very loosely, “consistent” means that if you get a response, it has the right answer, despite concurrency.

This is not what the “C” in “ACID transaction” means. It’s what the “I” means, namely “isolation” from concurrent operations. This is probably a source of confusion sometimes.

Furthermore, the paper says nothing about transactions, which have would have a beginning, a sequence of operations, and an end, which may commit or abort. “ACID” is talking about the entire transaction. The “linearizability” criterion only talks about individual operations on objects. (So the whole “ACID versus BASE” business, while cute, can be misleading.)

Available

“Available” is defined as “every request received by a non-failing node in the system must result in a response.” The phrase “non-failing node” seemed to imply that some nodes might be failing and others not. But since the paper postulates that nodes never fail, I believe the phrase is redundant, and can be ignored. After the definition, the paper says “That is, any algorithm used by the service must eventually terminate.”

The problem here is that “eventually” could mean a trillion years. This definition of “available” is only useful if it includes some kind of real-time limit: the response must arrive within a period of time, which I’ll call the maximum latency.

Next, it’s very important to notice that “A” says nothing about the content of the response. It could be anything, as far as “A” is concerned; it need not be “successful” or “correct”. (If think otherwise, see section 3.2.3.)

So “available” (“A”) means: If a client sends a request to a node, it always gets back some response within L time, but there is no guarantee about contents of the response.

Partition Tolerant

There is no definition, per se, of the term “partition-tolerant”, not even in section 2.3, “Partition Tolerance”.

First, what is a “partition”? They first define it to mean that there is a way to assort all the nodes into separate sets, which they call “components”, and all messages sent from a node in one component to another nodes in a separate component are lost. But then they go on to say “And any pattern of message loss can be modeled as a temporary partition separating the communicating nodes at the exact instance the message is lost.” or their formal purposes, “partition” simply means that a message can be lost. (The whole “component” business can be disregarded.)  That’s probably not what you had in mind!

In real life, some messages are lost and some aren’t, and it’s not exactly clear when a “partition” situation starts, is happening, or ends. I realize that for practical purposes, we usually know what a partition means, but if we’re going to do formal proofs and understand what was proved, one must be completely clear about these terms.

Even in a local-area network, packets can be dropped. Protocols like TCP re-transmit packets until the destination acknowledges that they have arrived. If that happens, it’s clearly not a network failure from the point of view of the application. “Losing messages” must have something to do with nodes entirely unable to communicate for a “long” time compared to the latency requirements of the system.

Furthermore, remember that node failure is treated as a network failure.

So “partition-tolerant” (“P”) means that any guarantee of consistency or availability is still guaranteed even if there is a partition. In other words, if a system is not partition-tolerant, that means that if the network can lose messages or any nodes can fail, then any guarantee of atomicity or consistency is voided.

CAP

The CAP theorem says that a distributed system as described above cannot have properties C, A, and P all at the same time. You can only have two of them. There are three cases:

AP: You are guaranteed get back responses promptly (even with network partitions), but you aren’t guaranteed anything about the value/contents of the response. (See section 3.2.3.) A system like this is entirely useless, since any answer can be wrong.

CP: You are guaranteed that any response you get (even with network partitions) has a consistent (linearizable) result. But you might not get any responses whatsoever. (See section 3.2.1.) This guarantee is also completely useless, since the entire system might always behave as if it were totally down.

CA: If the network never fails (and nodes never crash, as they postulated earlier), then, unsurprisingly, life is good. But if messages could be dropped, all guarantees are off. So a CA guarantee is only useful in a totally reliable system.

At first, this seems to mean that practical, large distributed systems (which aren’t entirely reliable) can’t make any useful guarantees! What’s going on here?

Weak Consistency

Large-scale distributed systems that must be highly available can provide some kind of “weaker” consistency guarantee than linearizability. Most such systems provide what they call “eventual consistency” and may return “stale data”.

For some applications, that’s OK. Google search is an obvious case: the search is already specified/known to be using “stale” data (data since the last time Google looked at the web page), so as long as partitions are fixed quickly relative to the speed of Google’s updating everything, (and even if sometimes not, for that matter), nobody is going to complain.

Just saying that results “might be stale” and will be “eventually consistent” is unfortunately vague. How stale can it be, and how long is “eventually”? If there’s no limit, then there’s no useful guarantee.

For a staleness-type weak consistency guarantee, you’d like to be able to say something like: “operations (that read) will always return a result that was consistent with all the other operations (that write) no longer ago than time X”. And this implies that “write” operations are never lost, i.e. always happen within a fixed time bound.

t-Connected Consistency

Gilbert and Lynch discuss “weakened consistency” in section 4. It’s also about stale data, but with “formal requirements on the quality of stale data returned”. They call it “t-Connected Consistency”.

It makes two assumptions. (a) Every node has a clock that can be used to do timeouts. The clocks don’t have to be synchronous with each other. (b) There’s some time period after which you can assume that an unanswered message must be lost. (c) Every node processes a received message within a given, known time.

The real definition of “t-Connected Consistency” is too formal for me to explain here (see section 4.4). It (basically) guarantees (1) when there is no partition, the system is fully consistent; (2) if a partition happens, requests can see stale data; and (3) and after the partition is fixed, there’s a time limit on how long it takes for consistency to return.

Are the assumptions OK in practice? Every real computer can do timeouts, so (a) is no problem. You can always ignore any responses to messages after the time period, so (b) is OK. It’s not obvious that every system will obey (c), but some will.

I have two reservations. First, if the network is so big that it’s never entirely working at any one time, what would guarantee (3) mean? Second, in the algorithm in section 4.4, in the second step (“write at node A”), it retries as long as necessary to get a response. But that could exceed L, violating the availability guarantee.

So it’s not clear how attractive t-Connected Consistency really is. It can be hard it is to come up with formal proofs of more complicated, weakened consistency guarantees. Most working software engineers don’t think much about formal proofs, but don’t underrate them. Sometimes they can help you identifying bugs that would otherwise be hard to track down, before they happen.

Jeff Darcy wrote a blog posting about “eventual consistency” about a half year ago, which I recommend. And there are other kinds of weak consistency guarantees, such as the one provided by Amazon’s Dynamo key-store, which worth examining.

Reliable Networks

Can’t you just make the network reliable, so that messages are never lost? (“Never” meaning that the probability of losing a message is as low as other failure mode that you’re not protecting against.)

Lots and lots of experience has shown that in a network with lots of routers and such, no matter how much redundancy you add, you will experience lost messages, and you will see partitions that last for a significant amount of time. I don’t have a citation to prove this, but, ask around and that’s what experienced operators of distributed systems will always tell you.

How many routers is “lots”? How reliable is it if you have no routers (layer 3 switches), only hubs (layer 2 switches)? What if you don’t even have hubs? I don’t have answers to all this. But if you’re going to build a distributed system that depends on a reliable network, you had better ask experienced people about these questions. If it involves thousands of nodes and/or is geographically distributed, you can be sure that the network will have failures.

And again, as far as the proof of the CAP theorem is concerned, node failure is treated as a network failure. Having a perfect network does you no good if machines can crash, so you’d also need each node to be highly-available in and of itself. That would cost a lot more than using “commercial off-the-shelf” computers.

The Bottom Line

My conclusion is that the proof of the CAP theorem means, in practice: if you want to build a distributed system that is (1) large enough that nodes can fail and the network can’t be guaranteed to never lose messages, and (2) you want to get a useful response to every request within a specified maximum latency, then the best you can guarantee about the meaning of the response is that it is guaranteed to have some kind of “weak consistency”, which you had better carefully define in such a way that it’s useful.

P.S.

After writing this but just before posting it, Alex Feinberg added a comment to my previous blog post with a link to this excellent post by Henry Robinson, which discusses many of the same issues and links to even more posts. If you want to read more about all this, take a look.

VoltDB versus NoSQL

Sunday, July 11th, 2010
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

Mike Stonebraker is the co-founder and CTO of VoltDB, which makes a novel on-line transaction processing (OLTP) relational database management system (RDBMS). He recently gave a talk entitled “VoltDB Decapitates Six SQL Urban Myths”. You can read the slides here. Much of the talk is a reply to the claims of the community building data stores often referred to as NoSQL data stores.

Todd Hoff of HighScalability has written an excellent commentary on the talk. If you want to understand what’s going on with VoltDB, you can’t do better than to read this (including the commentary, with some replies from VoltDB). I have a bit to add.

Benchmarking

Dr. Stonebraker’s talk includes benchmark results, which VoltDB ran much faster than MySQL and , a well-known NoSQL data store.

Over many years, I have found that what nearly everybody wants is a predictive “single number” that says how much faster one DBMS is than another. But applications differ hugely in their workloads, and measured speed depends tremendously on using the DBMS in the best way, including layout, clustering, indexing, partitioning, and all kinds of options, such as whether transactions are immediately made durable or not. Saying that one DBMS is “N times faster” than another DBMS is very misleading. But everyone wants the magic number, and are too quick to assume that the result of one benchmark predicts speed in all situations.

One must take into account that the VoltDB engineers wrote these micro-benchmarks, and ran them on a very specific workload, knowing what they were trying to prove. I do appreciate that they made a good-faith attempt to be fair, based on John Hugg’s comments above. And I can vouch that John is a very smart guy, and I believe all that he says in his comments above. Nevertheless, they did not bring in experts in the other systems who would could tune them optimally. Different benchmarks might be less flattering.

The old argument about assembly language versus high-level languages would be analogous if RDBMS optimizers worked as well as C/Java/etc compilers. SQL is supposed to be declarative: you just ask for what you want, and the RDBMS figures out the best way to get it. But my experience, and what my friends tell me, is that the optimizers in some popular RDBMS’s (especially Oracle) frequently make bad choices, and picking the wrong query strategy can slow things down by huge factors. So the developers are forced to override the optimizer with “hints”. It’s been over 30 years, and still the optimizers fail. Maybe it’s time to declare the experiment a failure. (This may not be an issue for VoltDB, as the SQL might be always be very simple or something.)

Stored Procedures

He’s right that performance can be hurt by too many round trips to the DBMS. But Oracle users have know for a long time that you have to use stored procedures to get high performance; this is nothing new. When you do this with Oracle, you end up with lots of PL/SQL code. Most of your developers can’t understand it, and it’s a proprietary language so you’re “locked in” to Oracle (it’s very hard to switch to a different DBMS).

It’s one thing to provide stored procedures as a way to improve performance. But VoltDB requires you to use stored procedures, and each interaction with VoltDB is a transaction. Any application that mixes database access with other operations that must be done on the client side cannot use VoltDB. The application has to be written in the VoltDB manner, from the beginning. This is like “lock-in” in some ways.

More about the VoltDB presentation

Todd says: “In contrast, the VoltDB model seems from a different age: a small number of highly tended nodes in a centralized location.” I don’t think this is right. For disaster recovery (e.g. blackouts), you need a replica far away; this has always been an integral part of VoltDB’s justification for not logging to disk. And then you have to worry about network partitions over a WAN. WAN’s are not yet supported in VoltDB.

I find Todd’s point about Amazon’s Dynamo very compelling: why would Amazon do so much work if partitions are so rare? At Amazon scale, partitions must be frequent enough to justify all this work. Not all VoltDB customers will be operating at that scale, but John Hugg has said that it’s designed for “Internet scale”. Dr. Stonebraker is right that there’s no substitute for actual measurement of how likely partition is.

Putting the burden on application programmers

Serious production databases are usually manged by database experts/administrators, who decide where to replicate what, whether and how to partition tables (across nodes), and so on.

But with VoltDB, the application developers have to understand a lot about this. For example, they need to know whether a procedure is single-partitioned, so they can assert that in the code. So they have to know about sharding, where replicas are, and so on. It makes the application brittle insofar as changes by the database administrators could break those assertions.

For example, a VoltDB engineer explained to a customer: “The goal of VoltDB is to optimize single-partition transactions and part of the responsibility for that falls on the application developer. You must write the queries to operate properly within a single partition and then declare the procedure to be single-partitioned. [...] Today, VoltDB does not verify that the SQL queries within a single-partitioned procedure are actually single-partitioned.” Another VoltDB engineer said: “The vast majority (almost 100%) of your stored procedure invocations must be single-partition for VoltDB to be useful to you.”

Different “NoSQL” systems also put such burdens on application programmers to greater or lesser degrees, as well. RDBMS’s have traditionally boasted that they hide these issues from application programmers. VoltDB uses SQL, but what it provides is very different from the original concept of the relational model.

What is a “SQL” database system?

You can see more of this in their “VoltDB do’s and don’ts list” Perhaps the most important point is the first “Don’t”: “Don’t use ad hoc SQL queries as part of a production application.” Dr. Stonebraker’s talk is very much a defense of using SQL for OLTP, rather than the “NoSQL” models such as key-value stores. But what does the restriction against “ad hoc” queries mean?

The original fundamental claim of relational DBMS’s (as opposed to the previous generation, the CODASYL-type DBMS’s) is that you don’t have know the access pattern; you just say what you want in SQL, and the DBMS figures out how to do it. Applications keep working even if there are changes in the storage layout, indexing, and whatever else the DBMS uses. But, as a VoltDB engineer said, “Part of VoltDB’s underlying premise is that workloads are known in advance.”

Even though VoltDB uses SQL, maybe it isn’t as far from the “NoSQL” storage engines as one might think!