Archive for June, 2011

Seed Funding and Angel Groups: The Fast and The Furious

Wednesday, June 29th, 2011
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

I have written a blog post called Seed Funding and Angel Groups: The Fast and The Furious, which was posted on Dharmesh Shah’s On Startups blog. It’s about the speed at which entrepreneurs can acquire seed financing, whether angel groups or venture capital partnerships can move faster, and the how much all this matters.

I put it on Dharmesh’s blog at his request. If you have comments, and if it’s all the same to you, it’s probably better to put them on his blog rather than here, just to keep all comments in the same place.

Comments on “Urban Myths about NoSQL”

Friday, June 17th, 2011
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

Dr. Michael Stonebraker recently posted a presentation entitled “Urban Myths about NoSQL”. Its primary point is to defend SQL, i.e. relational, database systems against the claims of the new “NoSQL” data stores. Dr. Stonebraker is one of the original inventors of relational database technology, and has been one of the most eminent database researchers and practitioners for decades.

Many of the virtues of relational databases described here are specifically about a new and highly innovative RDDBMS called VoltDB. VoltDB is made by a company called VoltDB.com, of which Dr. Stonebraker is co-founder and CTO. (There is also a good writeup about VoltDB here.)

The following are some comments about four of the six points in the presentation. I don’t consider any of these to “debunk” the presentation or anything like that, but they point out considerations that I feel should be taken into account.

#1: SQL is too slow:

This argument assumes a perfect (or excellent) query optimizer. If you talk to anyone who has ever done a high-performance system in Oracle DB or DB/2, and you will find out about serious problems in query optimizers. I am not saying that rolling-your-own C code is the answer, but query strategies often have to be provided explicitly by the developer or DBA.

Stored procedures have a serious problem: you can’t interleave your own code with database operations. This can particularly be a problem if each stored procedure is its own transaction rather than an operation within a transaction, as in VoltDB. Existing large systems may not be able to operate within that constraint, although new systems designed with that in mind might not have any problem witht this.

The “to go a lot faster” requires the whole database to be in main memory, as it is with VoltDB (the points on the slides here do not apply to RDBMS’s other than VoltDB.) The reason VoltDB can get rid of buffer management is that there are no (disk) buffers. VoltDB need not do lock management because there is no concurrency control: you just run every transaction to completion, since there is no reason to interleave transactions, since there are no I/O waits.

This is great if it works for your application. In point #5, he says that most OLTP databases are not very big, e.g. < 1TB, and for a database that size, using main memory is quite feasiable these days. The requiredment for the sizes of OLTB databases will probably rise with time. Of course, computers and memory are also getting faster and larger for the same price.

#3: SQL Systems don’t scale

If you have ever been in involved in benchmarking, you know how difficult it is to interpret benchmark results. Is it possible that these results were obtained by choosing a benchmark that is particularly favorable to VoltDB? The only benchmark that really matters is your own application: they are all different. Of course, the problem with that is that it’s hard to port your application merely to test performance. But by ignoring that and looking at other benchmarks, it’s like looking for a lost key under the streetlight because it’s easier to look there. I’m not saying that these numbers are misleading, and certainly not that they are intentionally misleading, but they are very hard to interpret without knowing exactly what was benchmarked, how everything was tuned, and so on. I say this from my own experience, having done benchmarking of database systems for years.

(Also notes that by TPC-C, he does not mean the officially defined TPC-C benchmark; look it up and you’ll see that it is a huge, major project to do it. He means a very simplified example based on the key concepts in TPC-C. (You can see this in the academic papers by him and others.) That said, if you do want a micro-benchmark that is as close to what people agree to be a good measure of online transaction performance, this might be the best one can do.)

#5: ACID is too slow

ACID is great for software developers, providing them a very clean and easy-to-understand model. Ease of understanding is crucial for achiving simplicity, which is the Holy Grail of software developement, enhancing maintainability and correctness. I’m all for ACID.

To clarify something often not explained well: the NoSQL stores are ACID. It’s just that what they can do within one ACID transaction is usually quite limited. For example, a transaction might only be able to fetch a value (or store a value, or increment a value) given the key, and then the transaction is over. That operation is ACID.

In a classic RDBMS, you can do many operations within one transaction. Your program says “begin transaction” (sometimes this is tacit), and then you can do computations that include both code and database queries/updates, interleaved. At the end you say “commit transaction”. (During or at the end of a transaction, the DBMS might have to abort the transaction.)

Right now, very few DBMS’s provide true ACID properties in the way they are really used in practice, for two reasons. First, they run at reduded “isolation levels”, which means that the “I” in ACID is compromised. See my blog article for an explanation of this.

Second, one often wants to provide a way to recover from the failure of an entire data center. This is done by having a second data center that is far enough away that it won’t be damaged by the failure of the primary data center. This means you can keep going in the face of a “disaster” such as a regional power outage, a tsunami, etc.

The problem is that if the data center is far enough away to have truly independent failure modes, then the network connection will have latency so high that it is not feasible to do synchronous commits for every transaction that update the distant copy. Most often, commit results are sent asynchronously to the distant copy. If the local data center fails, any transactions that had beeen committed, but had not yet reached the distant copy, are lost. So these transactions were not durable, the “D” in ACID. So there is a tradeoff here. (People live with this by being willing to do manual fixups in the face of a disaster.)

As discussed above, VoltDB transaction do not allow you to interleave code in your application with transactions. (The stored procedures can run arbitrary code, in Java, but that’s not the same what I described above.)

#6: In CAP, choose AP over CA

I disagree that network partitions are not a major concern. Very simple local-area networks do not suffer from partitions and network failures much, but even a medium-size network is vulnerable, and networks in large data centers are quite vulnerable, as you can easily learn from network operations experts. For example, routers fail, or are misconfigured.

Both Amazon and Google have published papers about their large-scale data stores. The papers talk a lot about how they deal with network partitions. If partitions were so unlikey, why are these large companies taking the problem so seriously, and using rather sophisticated techniques to deal with the partitions? Also, the study of how to deal with network partitions has been a hot topic of research for the last 35 years; again, why would that be true if partitions were not an important concern?

So, as your network becomes larger and more complex, dealing with partitions becomes more and more of an issue. My impression (I may be wrong) is that the “sweet spot” for VoltDB, at least at the moment, is for distributed systems that are not at the kind of very-large scale of an Amazon or Google, and indeed for a much smaller scale, which makes network partitions much less of a problem. There’s nothing wrong with this at all; I’m just trying to clarify the issue and explain the reason for the controversy about this point.

Final Note

There has been an exciting explosion of innovative database technology in the last few years. Many different kinds of applications have different requirements. It’s great news for all of us that there are so many solutions at different points in the requirement space.