VoltDB versus NoSQL

news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

Mike Stonebraker is the co-founder and CTO of VoltDB, which makes a novel on-line transaction processing (OLTP) relational database management system (RDBMS). He recently gave a talk entitled “VoltDB Decapitates Six SQL Urban Myths”. You can read the slides here. Much of the talk is a reply to the claims of the community building data stores often referred to as NoSQL data stores.

Todd Hoff of HighScalability has written an excellent commentary on the talk. If you want to understand what’s going on with VoltDB, you can’t do better than to read this (including the commentary, with some replies from VoltDB). I have a bit to add.

Benchmarking

Dr. Stonebraker’s talk includes benchmark results, which VoltDB ran much faster than MySQL and , a well-known NoSQL data store.

Over many years, I have found that what nearly everybody wants is a predictive “single number” that says how much faster one DBMS is than another. But applications differ hugely in their workloads, and measured speed depends tremendously on using the DBMS in the best way, including layout, clustering, indexing, partitioning, and all kinds of options, such as whether transactions are immediately made durable or not. Saying that one DBMS is “N times faster” than another DBMS is very misleading. But everyone wants the magic number, and are too quick to assume that the result of one benchmark predicts speed in all situations.

One must take into account that the VoltDB engineers wrote these micro-benchmarks, and ran them on a very specific workload, knowing what they were trying to prove. I do appreciate that they made a good-faith attempt to be fair, based on John Hugg’s comments above. And I can vouch that John is a very smart guy, and I believe all that he says in his comments above. Nevertheless, they did not bring in experts in the other systems who would could tune them optimally. Different benchmarks might be less flattering.

The old argument about assembly language versus high-level languages would be analogous if RDBMS optimizers worked as well as C/Java/etc compilers. SQL is supposed to be declarative: you just ask for what you want, and the RDBMS figures out the best way to get it. But my experience, and what my friends tell me, is that the optimizers in some popular RDBMS’s (especially Oracle) frequently make bad choices, and picking the wrong query strategy can slow things down by huge factors. So the developers are forced to override the optimizer with “hints”. It’s been over 30 years, and still the optimizers fail. Maybe it’s time to declare the experiment a failure. (This may not be an issue for VoltDB, as the SQL might be always be very simple or something.)

Stored Procedures

He’s right that performance can be hurt by too many round trips to the DBMS. But Oracle users have know for a long time that you have to use stored procedures to get high performance; this is nothing new. When you do this with Oracle, you end up with lots of PL/SQL code. Most of your developers can’t understand it, and it’s a proprietary language so you’re “locked in” to Oracle (it’s very hard to switch to a different DBMS).

It’s one thing to provide stored procedures as a way to improve performance. But VoltDB requires you to use stored procedures, and each interaction with VoltDB is a transaction. Any application that mixes database access with other operations that must be done on the client side cannot use VoltDB. The application has to be written in the VoltDB manner, from the beginning. This is like “lock-in” in some ways.

More about the VoltDB presentation

Todd says: “In contrast, the VoltDB model seems from a different age: a small number of highly tended nodes in a centralized location.” I don’t think this is right. For disaster recovery (e.g. blackouts), you need a replica far away; this has always been an integral part of VoltDB’s justification for not logging to disk. And then you have to worry about network partitions over a WAN. WAN’s are not yet supported in VoltDB.

I find Todd’s point about Amazon’s Dynamo very compelling: why would Amazon do so much work if partitions are so rare? At Amazon scale, partitions must be frequent enough to justify all this work. Not all VoltDB customers will be operating at that scale, but John Hugg has said that it’s designed for “Internet scale”. Dr. Stonebraker is right that there’s no substitute for actual measurement of how likely partition is.

Putting the burden on application programmers

Serious production databases are usually manged by database experts/administrators, who decide where to replicate what, whether and how to partition tables (across nodes), and so on.

But with VoltDB, the application developers have to understand a lot about this. For example, they need to know whether a procedure is single-partitioned, so they can assert that in the code. So they have to know about sharding, where replicas are, and so on. It makes the application brittle insofar as changes by the database administrators could break those assertions.

For example, a VoltDB engineer explained to a customer: “The goal of VoltDB is to optimize single-partition transactions and part of the responsibility for that falls on the application developer. You must write the queries to operate properly within a single partition and then declare the procedure to be single-partitioned. [...] Today, VoltDB does not verify that the SQL queries within a single-partitioned procedure are actually single-partitioned.” Another VoltDB engineer said: “The vast majority (almost 100%) of your stored procedure invocations must be single-partition for VoltDB to be useful to you.”

Different “NoSQL” systems also put such burdens on application programmers to greater or lesser degrees, as well. RDBMS’s have traditionally boasted that they hide these issues from application programmers. VoltDB uses SQL, but what it provides is very different from the original concept of the relational model.

What is a “SQL” database system?

You can see more of this in their “VoltDB do’s and don’ts list” Perhaps the most important point is the first “Don’t”: “Don’t use ad hoc SQL queries as part of a production application.” Dr. Stonebraker’s talk is very much a defense of using SQL for OLTP, rather than the “NoSQL” models such as key-value stores. But what does the restriction against “ad hoc” queries mean?

The original fundamental claim of relational DBMS’s (as opposed to the previous generation, the CODASYL-type DBMS’s) is that you don’t have know the access pattern; you just say what you want in SQL, and the DBMS figures out how to do it. Applications keep working even if there are changes in the storage layout, indexing, and whatever else the DBMS uses. But, as a VoltDB engineer said, “Part of VoltDB’s underlying premise is that workloads are known in advance.”

Even though VoltDB uses SQL, maybe it isn’t as far from the “NoSQL” storage engines as one might think!

16 Responses to “VoltDB versus NoSQL”

  1. John Hugg Says:

    Thanks for the post Dan.

    I totally agree that VoltDB has a lot in common, both practically and philosophically, with the NoSQL movement. Like many of these systems, VoltDB sheds some of the niceties that people have grown used to with today’s general purpose RDBMSs like Postgres, Oracle, etc.. VoltDB also puts a bigger burden on the app developer, something we’re working hard to minimize, but won’t go away anytime soon. Like NoSQL systems, VoltDB offers a lot of scale in return.

    Where VoltDB differs from some NoSQL systems is focus. Simple answer: we’ve chosen C over P in CAP. We have transactions and relationships. They have partition tolerance. Like I said in the follow up post to the key-value benchmarking we did, nobody is going to choose a system as different as VoltDB or Cassandra based on some benchmark I did. Both systems are fast, but the set of tradeoffs are very different.

    We’ve talked to a lot a lot of people who can’t give up the C in CAP, but still need scale. We built VoltDB for them. As you know personally, scaling legacy systems to internet scale takes a tremendous effort. While VoltDB (and NoSQL) might add additional work for the app developer, both share the goal of reducing the total effort (and cost) to deploy an application at scale.

    As all of these systems mature, the set of apps that make sense to run on them will increase. We’re working hard on new releases of VoltDB so we can be part of that fun. Stay tuned for 1.1 coming shortly.

  2. jdefarge Says:

    Hi Dan. It’s a great post, congrats! Well, I have some considerations to make on this post. :)

    First and foremost, you and Todd Hoff sort of pointed this out, but I would like to reinforce this statement one more time:

    ‘VoltDB is not a general purpose dbms system that solves all the problems in the DBMS field.’

    And as Todd point out: it will not kill Oracle-like DBMS nor NoSQL systems in a foreseeable future, but VolDB definitely has a lot of potential. The same is true for some NoSQL systems.

    Unfortunately, the NoSQL hype usually makes people think RDBMS like MySQL or Oracle will just vanish into thin air. :) This attitude is childish and myope, in my opinion.

    Regarding stored procedures, I guess that you put it lightly, as my +10-year experience shows that it’s virtually impossible to switch to other DBMS once you’ve started to use PL/SQL SPs. I’ve seen systems that have more than 60% of their business logic as SPs in companies that uses Oracle. But I think VoltDB approach doesn’t impose such a ‘hard’ lock-in as it uses a popular and ‘open’ language, Java, for its SPs.

    Additionally, one of Stonebraker’s papers claim that OLTP ‘interactive’ transactions do more harm than good, because:

    a) Your web app can do it without interactive transactions. Stonebraker claims that the last thing you would want is to hold a transaction for each person that is browsing a catalog on a e-commerce site. This is subject to warm discussion though. :)

    b) the use interactive transactions imposes a hard burden on the complexity and load of SQL engines as they have to support concurrent data structures and face hard scalability issues.

    >> But with VoltDB, the application developers have to understand a lot >> about this.
    >> (…)
    >> It makes the application brittle insofar as changes by the
    >> database administrators could break those assertions.

    According to Stonebraker’s paper, their ultimate goal for next gen database systems is to build a self-tunable/no-knobs dbms system.

    This vision is utopic, but if you are having scalability problems, e.g., VoltDB target use case, then app developers have to understand a lot about SQL, database partition, and the inner workings of the DBMS engine to succeed. In addition, they will need to work side-by-side to operational and DBA folks to solve the issues. This will have to happen regardless of where you use MySQL, VoltDB or a NoSQL system.

    In my opinion, VoltDB is not (yet) any better than a NoSQL system in this aspect, as the app developer needs to abandon their high level data management abstractions and get their hands ‘dirty’ by dabbling with partitions, distribution workload, and alike. The good news is that using VoltDB they are able to retain a sub-set of SQL, at least. ;)

    >> (…) just say what you want in SQL, and the DBMS figures out how to
    >> do it. Applications keep working even if there are changes in the
    >> storage layout, indexing, and whatever else the DBMS uses. But, as a
    >> VoltDB engineer said, “Part of VoltDB’s underlying premise is that
    >> workloads are known in advance.”

    After writing one or two VoltDB apps to play around, one thing clearly emerged: VoltDB definitely blurs the separation between the app and the database. I found using VoltDB quite like using HSQLDB, that is, from a developer point of view it looks almost quite like an embedded dbms. I still trying to figure out if this is good or bad by the end of the day, but if you are evaluating VoltDB you have to be aware of this. :)

    Note: there’s on-going research being conducted by MIT folks to create automatic database designers for VoltDB. These tools will take care of automatic partitioning and load distribution so that the developers will be freed from this burden. Nevertheless, I’ll not hold my breath waiting for this, because it’ll take a while until this research is ready for production.

    >> Even though VoltDB uses SQL, maybe it isn’t as far from the “NoSQL” >> storage engines as one might think!

    Couldn’t agree more! Ironically enough, VoltDB is closer to NoSQL systems than SQL systems. :)

    >> But what does the restriction against “ad hoc” queries mean?

    The answer is two fold:

    1) First, it’s a design trade-off: VoltDB distributes data and logic according to the workload and data access patterns, and it does so by analysing the schema and SPs. The “workloads are known in advance” is for doing this.

    If you start injecting interactive queries into the system then VoltDB would have to do such analysis on real-time and this would be sub-optimal, and quite hard to do;

    2) Second, SPs are great for debugging and testing. VoltDB allows you to use SPs for these tasks, but a high load app would be poorly benefitted from interactive queries nowadays.

    One of Stonebraker’s papers make a nice historical perspective: the scientists who created SQL imagined a scenario where business people, like accountants and clerks, for example, would issue commands on a terminal to manage the data. Well, 30 years later this scenario is far different because there are many software layers between your end-user and your dbms, and it uses a series of clicks and form filling to support her interaction with the dbms.

    In such scenario, there’s *really* a need to use interactive sql commands? It’s a thought provoking question. :)

  3. jdefarge Says:

    By the way, as VoltDB is for OLTP and OLTP only, it will need to pair up with other dbms systems if you like to provide ad-hoc query/OLAP/Reporting capabilities.

    Unlike Oracle and MySQL, that are one size fits all, the new age DBMS gurus claim that time has come to use multiple dbms on production.

  4. Alex Feinberg Says:

    Hi Dan,

    First, I really appreciate the excellent write up. One minor quibble: note that partitions don’t *just* happen in the context of network splits (“split brain” scenario, whether within a single datacenter or between multiple datacenters); failures and slowness of machines on the local network are partitions as well. Really, what happens is in that in a failure scenario there’s a trade-off between consistency and availability; when the system is operating “as normal” you have both. Henry Robinson has a really great write up on this:

    http://www.cloudera.com/blog/2010/04/cap-confusion-problems-with-partition-tolerance/

    It’s also important to note that C in CAP is not the same as C in ACID: A and I (atomic, isolated transactions which required consensus in when done across entities in a distributed system) would be rough equivalents to C in CAP (thanks to Jeff Darcy for stating this elegantly). The FLP impossibility result states that it’s impossible to achieve consensus in an asynchronous system; timeouts and failure detectors can be used to make an asynchronous system behave synchronously, but then slow, failed or partitioned nodes can’t be told apart (a node might still receive an RPC request, process it but merely do it outside of the timeout window).

  5. Dan Weinreb Says:

    @jdefarge: I agree that doing “interactive” transactions, in the sense of waiting for users during a transaction, is out of the question. In fact, even sending messages over a network and waiting for a response during a transaction is no good (except if there’s a very short timeout). Transactions in an OLTP DBMS must be kept short.

    But it can be very useful to have a transaction that performs a mixture of computation and queries, where the computation must be part of the application rather than part of a stored procedure. That’s what I was referring to. To use VoltDB, you must forego that. It’s one of the tradeoffs.

    About lock-in: although VoltDB uses Java, it seems to me that the Java code would have to make use of classes/libraries that are specific to VoltDB in order to do anything useful. (I don’t actually know.)

    @Alex Feinberg: Yes, I totally agree. See my upcoming blog post, inspired by Jeff Darcy’s and with help from him.

  6. John Hugg Says:

    @Alex Feinberg: “consistency and availability; when the system is operating “as normal” you have both”

    This is true so long as you limit the scope of consistency to a single value, row, column family, etc… for many of the NoSQL systems anyway.

    VoltDB allows you to bundle read and write operations to multiple rows in multiple tables with atomicity and isolation. There’s no penalty for doing this within a single partition. It’s natively supported across partitions at reduced performance.

    But you’re right that a crucial difference is how these systems handle partitions. VoltDB is much more likely to lose availability. Eventually consistent systems will stay up, but may create diverging data that will have to be reconciled. If the data to be reconciled is a set of immutable things, like a user’s tweets, friend connections or shopping cart contents, then this reconciliation might be pretty trivial.

    If the data to be reconciled involves non-communitive operations, especially those with numbers, then it’s possible that two partitions could be reporting quite incorrect results during the partition. Depending on the application, it may also be very difficult to reconcile once the partition is over.

    Neither failure scenario is great, but for most apps, one scenario is much worse than the other. That’s why we believe there’s a place for both kinds of systems.

  7. jmart Says:

    re: round trip and stored procedures: one thing that is completely missed in this discussion is that the client-side jdbc impl can and should cache information thereby eliminating the extra round-trips that stonebraker suggests. only a brain-deal developer would create a jdbc impl that doesn’t batch rows between the client and the server. i’m sure all commercial vendors do this. it is disingenuous for stonebraker to suggest that jdbc/odbc is inherently slow because it is primarily iterator oriented (has batch interfaces too).

  8. Alex Feinberg Says:

    @John Hugg:

    “Neither failure scenario is great, but for most apps, one scenario is much worse than the other. That’s why we believe there’s a place for both kinds of systems.”

    In “violent agreement” on this.

  9. Dan Weinreb Says:

    @John Hugg: VoltDB cannot simply be “CA”, since that would mean it is “CA” only if there are never node crashes and the network never loses messages (see my next blog post, about what the CAP theorem means). VoltDB must also have some form of “weaker” protection against such failures. Replicas are one part of this. One must be careful about where to put replicas, taking into account which failures are independent of other failures (so that the hosts of both replicas don’t crash together due to a single failure). The Technical Overview says that it does replicas automatically, but surely it has to be told things about which boxes are on the same network switch and so on, for it to make wise choices by itself.

    VoltDB also guarantees more than the “C” of the paper. Their “C” is only linearizability. VoltDB provides actual transactions, which provide linearizability but much more.

    I have a question. VoltDB is based, at least to some degree, on H-Store. H-Store was a university prototype, of about two years ago. The H-Store paper that I read talked about three kinds of queries (“transaction classes”): single-sited, one-shot, and general. Is the one-shot concept in VoltDB? Does VoltDB do general transactions (even if they’re not recommended and don’t claim to be fast)?

    Thanks! VoltDB so novel and cool! I hope everything is going great at VoltDB.

  10. John Hugg Says:

    @Dan Weinreb: We expect the first versions of VoltDB will be deployed on a single switch in a single rack (<40 nodes). When we roll out WAN support, we plan to support multiple single-switch-based clusters, each with a full copy of the data. We use TCP/IP networking, so we expect the primary failures to be individual nodes, or clusters as a whole. “Two-brain” network partitions, as Alex puts it, should be extremely rare within a rack. Two-brain scenarios are easier to run into over the WAN, but that’s true for most redundant systems today and there are ways of mitigating that pain.

    Over time we will, as you say, be more careful with partition placement, and support clusters that span racks with lots of machines. This is actually particularly interesting in some of the public clouds available, as they don’t give you all the information you might like yet.

    Still, we have yet to run into a use case that would need more than 20 nodes or so. We have no problem doing millions of multi-statement transactions per second on a cluster that size, and individual machines aren’t getting slower over time. One nice thing about being more than an order of magnitude faster is that you need many fewer machines.

    Re transaction types: VoltDB asks the developer to classify transactions as either single-partition or multi-partition. We do support fully general transactions with multiple round trips, but they are of course, much slower. One thing we allow the developer to to is to tag a batch of SQL as the “final” batch for a transaction. This hint allows VoltDB to combine some cleanup work and can eliminate a round trip on the network. For a transaction with a single batch of SQL (one or more statements with no Java in between), this hint allows for the transaction to often complete in one round trip. These are the “one-shot” transactions we used to talk about.

    Thanks for the well wishes. Things are good at VoltDB. Congrats on the Google purchase.

  11. Dan Weinreb Says:

    @John Hugg: I just want to emphasize again that “P” in the Gilbert and Lynch paper refers not merely to “split-brain” scenarios, but actually to any node or network failure whatsoever. For VoltDB, which is a “CA” system, that means if you want to get “CA”, you have to show that the system remains “CA” in the face of network or node failures. (See my next blog post.) From a practical point of view, as Mike Stonebraker has pointed out, the network and nodes need only be as fail-proof as other failures modes that are inevitable (well, they should be better than that to some degree); they do not need to be perfect. But none of us needs a fancy proof to know that!

    Everything you say makes complete sense to me, and I’m happy to hear that the general transactions work properly. That must have required a lot of good engineering! I’m glad to hear things are going so well.

  12. John Hugg Says:

    @Dan Weinreb: Yes, I’ve read the Gilbert and Lynch paper and have followed much of the recent discussion around the web on the CAP theorem. I guess my position is that the CAP theorem is useful from a common sense perspective. I get how you can’t have all three, but the “pick two” mentality has never made all that much sense to me in terms of real systems.

    VoltDB is clearly interested in “C”, but it’s not cleanly “CP” or “CA”. It will stay available in the face of many partitions/failures, but there are some scenarios where it will stop service if it can’t guarantee “C”.

    That said, it’s our goal to make failures that stop service as avoidable as possible.

  13. Delicious Bookmarks for July 27th from 13:30 to 13:34 « Lâmôlabs Says:

    [...] Dan Weinreb’s blog » Blog Archive » VoltDB versus NoSQL – July 27th %(postalicious-tags)( tags: nosql database scalability voltdb comparison )% [...]

  14. NoSQL Daily – Wed Sep 15 › PHP App Engine Says:

    [...] Dan Weinreb’s blog » Blog Archive » VoltDB versus NoSQL [...]

  15. ehcache.net Says:

    VoltDB versus NoSQL…

    Mike Stonebraker is the co-founder and CTO of VoltDB, which makes a novel on-line transaction processing (OLTP) relational database management system (RDBMS). He recently gave a talk entitled “VoltDB Decapitates Six SQL Urban Myths”. You can read the s…

  16. Tom G Says:

    Additional problem with ad-hoc – code injection. If it can be defined and executed, it’s a problem.

    Oracle has had this issue with PL/SQL since it was introduced.

Leave a Reply