Mike Stonebraker is the co-founder and CTO of VoltDB, which makes a novel on-line transaction processing (OLTP) relational database management system (RDBMS). He recently gave a talk entitled “VoltDB Decapitates Six SQL Urban Myths”. You can read the slides here. Much of the talk is a reply to the claims of the community building data stores often referred to as NoSQL data stores.
Todd Hoff of HighScalability has written an excellent commentary on the talk. If you want to understand what’s going on with VoltDB, you can’t do better than to read this (including the commentary, with some replies from VoltDB). I have a bit to add.
Over many years, I have found that what nearly everybody wants is a predictive “single number” that says how much faster one DBMS is than another. But applications differ hugely in their workloads, and measured speed depends tremendously on using the DBMS in the best way, including layout, clustering, indexing, partitioning, and all kinds of options, such as whether transactions are immediately made durable or not. Saying that one DBMS is “N times faster” than another DBMS is very misleading. But everyone wants the magic number, and are too quick to assume that the result of one benchmark predicts speed in all situations.
One must take into account that the VoltDB engineers wrote these micro-benchmarks, and ran them on a very specific workload, knowing what they were trying to prove. I do appreciate that they made a good-faith attempt to be fair, based on John Hugg’s comments above. And I can vouch that John is a very smart guy, and I believe all that he says in his comments above. Nevertheless, they did not bring in experts in the other systems who would could tune them optimally. Different benchmarks might be less flattering.
The old argument about assembly language versus high-level languages would be analogous if RDBMS optimizers worked as well as C/Java/etc compilers. SQL is supposed to be declarative: you just ask for what you want, and the RDBMS figures out the best way to get it. But my experience, and what my friends tell me, is that the optimizers in some popular RDBMS’s (especially Oracle) frequently make bad choices, and picking the wrong query strategy can slow things down by huge factors. So the developers are forced to override the optimizer with “hints”. It’s been over 30 years, and still the optimizers fail. Maybe it’s time to declare the experiment a failure. (This may not be an issue for VoltDB, as the SQL might be always be very simple or something.)
He’s right that performance can be hurt by too many round trips to the DBMS. But Oracle users have know for a long time that you have to use stored procedures to get high performance; this is nothing new. When you do this with Oracle, you end up with lots of PL/SQL code. Most of your developers can’t understand it, and it’s a proprietary language so you’re “locked in” to Oracle (it’s very hard to switch to a different DBMS).
It’s one thing to provide stored procedures as a way to improve performance. But VoltDB requires you to use stored procedures, and each interaction with VoltDB is a transaction. Any application that mixes database access with other operations that must be done on the client side cannot use VoltDB. The application has to be written in the VoltDB manner, from the beginning. This is like “lock-in” in some ways.
More about the VoltDB presentation
Todd says: “In contrast, the VoltDB model seems from a different age: a small number of highly tended nodes in a centralized location.” I don’t think this is right. For disaster recovery (e.g. blackouts), you need a replica far away; this has always been an integral part of VoltDB’s justification for not logging to disk. And then you have to worry about network partitions over a WAN. WAN’s are not yet supported in VoltDB.
I find Todd’s point about Amazon’s Dynamo very compelling: why would Amazon do so much work if partitions are so rare? At Amazon scale, partitions must be frequent enough to justify all this work. Not all VoltDB customers will be operating at that scale, but John Hugg has said that it’s designed for “Internet scale”. Dr. Stonebraker is right that there’s no substitute for actual measurement of how likely partition is.
Putting the burden on application programmers
Serious production databases are usually manged by database experts/administrators, who decide where to replicate what, whether and how to partition tables (across nodes), and so on.
But with VoltDB, the application developers have to understand a lot about this. For example, they need to know whether a procedure is single-partitioned, so they can assert that in the code. So they have to know about sharding, where replicas are, and so on. It makes the application brittle insofar as changes by the database administrators could break those assertions.
For example, a VoltDB engineer explained to a customer: “The goal of VoltDB is to optimize single-partition transactions and part of the responsibility for that falls on the application developer. You must write the queries to operate properly within a single partition and then declare the procedure to be single-partitioned. [...] Today, VoltDB does not verify that the SQL queries within a single-partitioned procedure are actually single-partitioned.” Another VoltDB engineer said: “The vast majority (almost 100%) of your stored procedure invocations must be single-partition for VoltDB to be useful to you.”
Different “NoSQL” systems also put such burdens on application programmers to greater or lesser degrees, as well. RDBMS’s have traditionally boasted that they hide these issues from application programmers. VoltDB uses SQL, but what it provides is very different from the original concept of the relational model.
What is a “SQL” database system?
You can see more of this in their “VoltDB do’s and don’ts list” Perhaps the most important point is the first “Don’t”: “Don’t use ad hoc SQL queries as part of a production application.” Dr. Stonebraker’s talk is very much a defense of using SQL for OLTP, rather than the “NoSQL” models such as key-value stores. But what does the restriction against “ad hoc” queries mean?
The original fundamental claim of relational DBMS’s (as opposed to the previous generation, the CODASYL-type DBMS’s) is that you don’t have know the access pattern; you just say what you want in SQL, and the DBMS figures out how to do it. Applications keep working even if there are changes in the storage layout, indexing, and whatever else the DBMS uses. But, as a VoltDB engineer said, “Part of VoltDB’s underlying premise is that workloads are known in advance.”
Even though VoltDB uses SQL, maybe it isn’t as far from the “NoSQL” storage engines as one might think!