Rebuttal to Stallman’s Story About The Formation of Symbolics and LMI

Richard Stallman has been telling a story about the origins of the Lisp machine companies, and the effects on the M.I.T. Artificial Intelligence Lab, for many years. He has published it in a book, and in a widely-referenced paper, which you can find at http://www.gnu.org/gnu/rms-lisp.html.

His account is highly biased, and in many places just plain wrong. Here’s my own perspective on what really happened.

Richard Greenblatt’s proposal for a Lisp machine company had two premises. First, there should be no outside investment. This would have been totally unrealistic: a company manufacturing computer hardware needs capital. Second, Greenblatt himself would be the CEO. The other members of the Lisp machine project were extremely dubious of Greenblatt’s ability to run a company. So Greenblatt and the others went their separate ways and set up two companies.

Stallman’s characterization of this as “backstabbing”, and that Symbolics decided not “not have scruples”, is pure hogwash. There was no backstabbing whatsoever. Symbolics was extremely scrupulous. Stallman’s characterization of Symbolics as “looking for ways to destroy” LMI is pure fantasy.

Stallman claims that Symbolics “hired away all the hackers” and that “the AI lab was now helpless” and “nobody had envisioned that the AI lab’s hacker group would be wiped out, but it was” and that Symbolics “wiped out MIT”. First of all, had there been only one Lisp machine company as Stallman would have preferred, exactly the same people would have left the AI lab. Secondly, Symbolics only hired four full-time and one part-time person from the AI lab (see below).

Stallman goes on to say: “So Symbolics came up with a plan. They said to the lab, ‘We will continue making our changes to the system available for you to use, but you can’t put it into the MIT Lisp machine system. Instead, we’ll give you access to Symbolics’ Lisp machine system, and you can run it, but that’s all you can do.’” In other words, software that was developed at Symbolics was not given away for free to LMI. Is that so surprising? Anyway, that wasn’t Symbolics’s “plan”; it was part of the MIT licensing agreement, the very same one that LMI signed. LMI’s changes were all proprietary to LMI, too.

Next, he says: “After a while, I came to the conclusion that it would be best if I didn’t even look at their code. When they made a beta announcement that gave the release notes, I would see what the features were and then implement them. By the time they had a real release, I did too.” First of all, he really was looking at the Symbolics code; we caught him doing it several times. But secondly, even if he hadn’t, it’s a whole lot easier to copy what someone else has already designed than to design it yourself. What he copied were incremental improvements: a new editor command here, a new Lisp utility there. This was a very small fraction of the software development being done at Symbolics.

His characterization of this as “punishing” Symbolics is silly. What he did never made any difference to Symbolics. In real life, Symbolics was rarely competing with LMI for sales. LMI’s existence had very little to do with Symbolics’s bottom line.

And while I’m setting the record straight, the original (TECO-based) Emacs was created and designed by Guy L. Steele Jr. and David Moon. After they had it working, and it had become established as the standard text editor at the AI lab, Stallman took over its maintenance.

Here is the list of Symbolics founders. Note that Bruce Edwards and I had worked at the MIT AI Lab previously, but had already left to go to other jobs before Symbolics started. Henry Baker was not one of the “hackers” of which Stallman speaks.

  • Robert Adams (original CEO, California)
  • Russell Noftsker (CEO thereafter)
  • Minoru Tonai (CFO, California)
  • John Kulp (from MIT Plasma Physics Lab)
  • Tom Knight (from MIT AI Lab)
  • Jack Holloway (from MIT AI Lab)
  • David Moon (half-time as MIT AI Lab)
  • Dan Weinreb (from Lawrence Livermore Labs)
  • Howard Cannon (from MIT AI Lab)
  • Mike McMahon (from MIT AI Lab)
  • Jim Kulp (from IIASA, Vienna)
  • Bruce Edwards (from IIASA, Vienna)
  • Bernie Greenberg (from Honeywell CISL)
  • Clark Baker (from MIT LCS)
  • Chris Terman (from MIT LCS)
  • John Blankenbaker (hardware engineer, California)
  • Bob Williams (hardware engineer, California)
  • Bob South (hardware engineer, California)
  • Henry Baker (from MIT)
  • Dave Dyer (from USC ISI)
http://danweinreb.org/blog/rebuttal-to-stallmans-story-about-the-formation-of-symbolics-and-lmi

Perst, An Embedded Object-Oriented Database Management System

I just learned of Perst, which is described as an open-source embedded object-oriented Java database (meaning database management system, of course) which claims ease of working with Java objects, “exceptional transparent persistence”, and suitability for aspect-oriented programming with tools such as AspectJ and JAssist. It’s available under a dual license, with the free non-commercial version under the GPL. (There is a .NET version aimed at C# as well, but I’ll stick to what I know; presumably the important aspects are closely analogous.)

I have some familiarity with PSE Pro for Java, from what is now the ObjectStore division of Progress. Its official name is now Progress ObjectStore PSE Pro. I’ll refer to it as PP4J, for brevity. I was one of its original implementors, but it was substantially enhanced after I left. I also have a bit of familiarity with Berkeley Database Java Edition, a more recently developed embedded DBMS, which I’ll refer to as BDBJE.

PP4J provides excellent transparency, in the sense that you “just program in Java and everything works.” It does this by using a class-file postprocessor. However, Perst claims to provide the same benefit without such a preprocessor. It also claims to be “very fast, as much as four times faster than alternative commercial Java OODBMS.” Of course, “as much as” usually means that they found one such micro-benchmark; still, it would be uninteresting had they not even claimed good performance. And it has ACID transactions with “very fast” recovery.

Those are very impressive claims. In the rest of this post, I’ll examine them.

Who Created Perst?

I always like to glance at the background of the company. In particular, I like to know who the lead technical people are and where they worked before. Unfortunately, the company’s management page only lists the CEO, COO, and Director of Marketing, which is rather unusual. They’re in Issaquah, WA; could the technical people be ex-Microsoft? It’s important to note that McObject’s main product line, called eXtremeDB(tm), is technically unrelated to Perst.

But I found a clue. The Java package names start with org.garret. It’s usually hard to retroactively change Java package names, so they’re often leftovers from an earlier naming scheme. By doing some searches on “garret”, I found Konstantin Khizhnik, a 36-year old software engineer from Moscow with 16 or so years experience, who has written and is distributing an object-oriented database system called “GOODS” (the “G” stands for “Generic”). His most recent release was March 2, 2007. He has a table that compares the features of GOODS with those of other systems, including Perst. At the bottom it says: “I have three products GigaBASE, PERST and DyBASE build on the same core.” He also has an essay named Overview of Embedded Object Oriented Databases for Java and C# which includes an extensive section on the architecture of Perst. This page also has some micro-benchmark comparisons including Perst, PP4J, BDBJE, and db40, but not GOODS. Perst comes out looking very good.

He even has a table of different releases of several DBMS’s, including GOODS and Perst, saying what changes were made in each minor release! But at no point does he say that he was involved in creating the Perst technology.

He mentions the web site perst.org. There’s nothing relevant there now, but Brewster’s Wayback machine shows that there used to be, starting in October, 2004. It’s quite clearly the same Perst. And the “Back to my home page” link is to Knizhnik’s home page. Aha, the smoking gun! By December, 2005, the site now mentions the dual license, and directs you to McObject LLC for a non-GPL commercial license. In 2006, the site changes to the McObject web site. McObject has several other embedded database products and was founded in 2001. This strongly suggests that McObject bought Perst from Knizhnik in 2006.

I joined the Yahoo “oodbms” group, and there’s Knizhnik, who is apparently maintaining FastDB, GigaBASE, and GOODS. He also wrote Dybase, based on the same kernel as GigaBASE. He announced Perst Lite in October, 2006. Postings on this group are sparse, mainly consisting of announcements of new minor releases of those three DBMS’s

The Tutorial
The documentation starts with a tutorial. Here are the high points, with my own comments in square brackets. My comparisons are mainly with PP4J, which likewise provides transparent Java objects. BDBJE works at a lower level of abstraction. [Update: BDBJE now has a transparent Java object feature, called DPL.] I don’t know enough about the low-level organization of BDBJE or of the current PP4J to make well-informed qualitative comparisons.

Perst claims to efficiently manages much more data than can fit in main memory. It has slightly different libraries for Java 1.1, 1.4, and 1.5, and J2ME. There is a base class called Persistent that you have to use for all persistent classes. [This is a drawback, due to Java's lack of multiple inheritance of implementation. PP4J does not have this restriction.] They explain a workaround in which you can copy some of the code of their Persistent.java class. [That sounds somewhat questionable from a modularity point of view, and doesn't help you for third-party libraries unless you want to mess with their sources.]

Files that hold databases can be stored compressed, encrypted, or as several physical files, in no file at all for in-memory use, and there’s an interface allowing you to add your own low level representation. Each database has a root object in the usual way. They use persistence-by-reachability [like PP4J]. There is a garbage collector for persistent objects. However, there is also explicit deletion; they correctly point out that this can lead to dangling pointers. [The fact that they have it at all suggests that the garbage collector is not always good enough.]

There are six ways to model relationships between objects. To their credit, they have a “(!)” after the word “six”. You can use arrays, but they explain the drawbacks to this. The Relation class is like a persistent ArrayList. The Link class is like Relation but it’s embedded instead of being its own object [huh?]. The IPersistentList interface has implementing classes that store a collection as a B-tree, which is good for large collections but has high overhead for small ones. Similarly, there is IPersistentSet. And finally there is a collection that automatically mutates from Link to a B-tree as the size gets larger. [PP4J, I believe, offers equivalents of the array, the list, and the set, and the list and set do the automatic mutation.]

How can they do transparent loading of objects, i.e. following pointers? They give you two choices, which can be set as a mode for each object: load everything reachable from the object, or make the programmer explicitly call the “load” method. They claim that this is usually sufficient, since your Java data structures usually consist of clusters of Java objects that are reachable from one head object, with no references between such clusters.

They assume that you always want to read in the whole cluster when you touch the head object [often true, but not always]. Also, when you modify an object, you must explicitly call the “modify” method, unless this is one of Perst’s library classes, whose methods call “modify” on themselves when needed. They say “Failure to invoke the modifymethod can result in unpredictable application behavior.”

[This is not what I would call "transparent"! PP4J is truly transparent, in that there is neither a "load" nor a "modify". PP4J always does these automatically. The Perst tutorial does not say what happens if you forget to call "load" when you were supposed to. Not all Java data follows their cluster model. PP4J depends for its transparency on the class postprocessor. As I recall, the postprocessor runs quickly enough that it doesn't seriously impact the total compile time. The only problem I had with it, as a user, was that it doesn't fit with the model assumed by the interactive development environments such as IntelliJ, requiring some inelegance in setting up your project.]

Perst has collections with indexes implemented as B-trees and allowing exact, range, and prefix searches. The programmer must explicitly insert and delete objects from indexes; if you make a change to some object that might affect any indexes, you have to remove the object from the index and re-insert it. [So you need to know in advance which things might ever be indexed, or pessimistically assume that they all are, and so remove and re-insert whenever the object is changed in any way. [I am pretty sure that PP4J does this automatically.] You can index on fields directly, or you can create your own index values (since you’re inserting explicitly) that could be any function of the indexed object. [That's very useful, and I cannot remember whether PP4J provides this.] Keys can be compound (several fields). They provide R-tree indexes and KD-tree indexes, useful for 2-D operations such as finding a point within certain constraints. They also provide Patricia Tries, bit indexes, and more. [Wow, how do they fit all that into such a small footprint?]

Transaction recovery is done with a shadow-object mechanism and can only do one transaction at a time. (So ACID really means AD.) [Like PP4J, at least in its first version.] The interaction of transaction semantics with threads, always a sticky issue, can be done in several ways, too extensive to go into here. [This looks very good.] Locks are multiple-reader single-writer. Locking is not transparent in basic Perst [Bad!], but there’s a package called “Continuous” which does provide transparency, although it’s not described in the tutorial. [So beginning users also have to remember to do explicit locking?] Two processes can access a single database by locking the whole database at the file system level; it works to have many readers.

There is a class called “Database” that provides semantics more like a conventional DBMS. It maintaints extents of classes. [Note: that means instances of the class can never be deallocated.] It can created indexes automatically based on Java annotations, but you still must do manual updates when the indexed object changes. It uses table-level locking. It has a query lanauge called JSQL, but it’s unlike SQL in that it returns objects rather than tuples, and does not support joins, nested selects, grouping, nor aggregate functions. You can “prepare” (pre-parse) JSQL queries, to improve performance if you use them many times, just as with most relational DBMS’s. A JSQL query is like a SQL “where” clause, and it uses whatever existing indexes are appropriate.

Schema evolution is automatic, and done per-object as the object is modified. It can’t handle renaming classes and fields, moving fields to a descendant or ancestor class, changing the class hierarchy, nor changing types of fields that aren’t convertible in the Java language. There’s an export/import facility that you’d use for those changes. You can physically compact the database files. Backup and restore are just like files [you have to back up the whole thing, not just changes, which is probably true of PP4J as well.] You can export to XML.

Perst supports automatic database replication. There’s one master, where all writes are performed, and any number of slaves, where reads can be performed. This lets you load-balance reads. It’s done at page granularity. You can specify whether it’s synchronous or asynchronous. You can add new slaves dynamically. For high-availability, it’s your responsibility to detect node failure and choose a new master node. [PP4J did not have this, the last time I looked.]

Recent Press Releases from McObject

Version 3.0 has new features. There is a full-text search, much smaller in footprint than Lucene and geared specifically to persistent classes. The .NET version supports LINQ. They list CA’s Wily Technology as a successful reference customer, for a real-time Java application.

Perst is used in Frost, which is a client for Freenet. “Frost is a newsgroup reader-like client application used to share encrypted messages, files and other information over Freenet without fear of censorship.” They switched from “the previous used SQL database” [it turns out to be something called McKoi] because its recovery was unreliable (leaving corrupt databases), Perst’s schema evolution is easier to use, the databases are smaller (because Perst can store strings in UTF-8 encoding), and because they could get better performance from Perst as the database size grew.

Perst has been verified as compatible with Google’s Android. They provide a benchmark program comparing performance of Perst against Android’s bundled SQLList. It’s a simple program that makes objects with one int field and one String field, and an index on each field. It inserts, looks up, etc. [It would be easy to recode it for PP4J or for B2B4J.]

The Download

The basic jar file is 530KB. “continuous” (see above) is another 57KB.

There’s more documentation, which goes into great detail about the internal implementation of how objects are represented and how atomic commit works. [It's extremely similar to PP4J. (The original version, anyway; it might have changed since.)]

There are other features, for which I could not find documentation. For intsance, each persistent class can have a “custom” allocator that you supply. You could use this to represent very large objects (BLOB/CLOB) by putting them in a separate file. In the database, you’d store the name of this file, and perhaps an offset or whatever. Also, there is an implementation of the Resource Description Framework (RDF, used by the Semantic Web to represent metadata).

There are lots of parameters that you can set from environment variables. I was not able to find documentation for these. The one that interests me most lets you control the behavior of the object cache. The default is a weak hash table with LRU behavior. Other possibilities are a weak table without the LRU, a strong hash table (if you don’t want to limit the object cache size), and a SoftHashTable which uses a Java “soft” hash table.

The code is clearly written except that it’s extremely short on comments.

Overall Evaluation

Perst is a lot like PP4J. To my mind, the most important difference is the degree of transparency. I greatly prefer PP4J’s approach of providing complete transparency, i.e. not requiring the use of methods such as load and modify. This has two advantages. First, your code is clearer and simpler if isn’t interrupted by all those calls to load and modify. Second, without transparency, it’s far too easy to forget to call load or modify, which would cause a bug, in some cases a bug that’s hard to find. Another problem is that the reference documentation is clearly incomplete and needs work. The tutorial, though, is quite clear and professionally-written, and very honest about the tradeoffs, pros, and cons of the product design. Personally, if you want to my respect, that’s how to do it!

However, it has a bunch of features and package that PP4J doesn’t (as far as I know).

I don’t know anything about the pricing of either product.

On the whole, for what it’s aiming for, Perst appears to be a very good, and a real competitor in this space.

VoltDB versus NoSQL

Mike Stonebraker is the co-founder and CTO of VoltDB, which makes a novel on-line transaction processing (OLTP) relational database management system (RDBMS). He recently gave a talk entitled “VoltDB Decapitates Six SQL Urban Myths”. You can read the slides here. Much of the talk is a reply to the claims of the community building data stores often referred to as NoSQL data stores.

Todd Hoff of HighScalability has written an excellent commentary on the talk. If you want to understand what’s going on with VoltDB, you can’t do better than to read this (including the commentary, with some replies from VoltDB). I have a bit to add.

Benchmarking

Dr. Stonebraker’s talk includes benchmark results, which VoltDB ran much faster than MySQL and , a well-known NoSQL data store.

Over many years, I have found that what nearly everybody wants is a predictive “single number” that says how much faster one DBMS is than another. But applications differ hugely in their workloads, and measured speed depends tremendously on using the DBMS in the best way, including layout, clustering, indexing, partitioning, and all kinds of options, such as whether transactions are immediately made durable or not. Saying that one DBMS is “N times faster” than another DBMS is very misleading. But everyone wants the magic number, and are too quick to assume that the result of one benchmark predicts speed in all situations.

One must take into account that the VoltDB engineers wrote these micro-benchmarks, and ran them on a very specific workload, knowing what they were trying to prove. I do appreciate that they made a good-faith attempt to be fair, based on John Hugg’s comments above. And I can vouch that John is a very smart guy, and I believe all that he says in his comments above. Nevertheless, they did not bring in experts in the other systems who would could tune them optimally. Different benchmarks might be less flattering.

The old argument about assembly language versus high-level languages would be analogous if RDBMS optimizers worked as well as C/Java/etc compilers. SQL is supposed to be declarative: you just ask for what you want, and the RDBMS figures out the best way to get it. But my experience, and what my friends tell me, is that the optimizers in some popular RDBMS’s (especially Oracle) frequently make bad choices, and picking the wrong query strategy can slow things down by huge factors. So the developers are forced to override the optimizer with “hints”. It’s been over 30 years, and still the optimizers fail. Maybe it’s time to declare the experiment a failure. (This may not be an issue for VoltDB, as the SQL might be always be very simple or something.)

Stored Procedures

He’s right that performance can be hurt by too many round trips to the DBMS. But Oracle users have know for a long time that you have to use stored procedures to get high performance; this is nothing new. When you do this with Oracle, you end up with lots of PL/SQL code. Most of your developers can’t understand it, and it’s a proprietary language so you’re “locked in” to Oracle (it’s very hard to switch to a different DBMS).

It’s one thing to provide stored procedures as a way to improve performance. But VoltDB requires you to use stored procedures, and each interaction with VoltDB is a transaction. Any application that mixes database access with other operations that must be done on the client side cannot use VoltDB. The application has to be written in the VoltDB manner, from the beginning. This is like “lock-in” in some ways.

More about the VoltDB presentation

Todd says: “In contrast, the VoltDB model seems from a different age: a small number of highly tended nodes in a centralized location.” I don’t think this is right. For disaster recovery (e.g. blackouts), you need a replica far away; this has always been an integral part of VoltDB’s justification for not logging to disk. And then you have to worry about network partitions over a WAN. WAN’s are not yet supported in VoltDB.

I find Todd’s point about Amazon’s Dynamo very compelling: why would Amazon do so much work if partitions are so rare? At Amazon scale, partitions must be frequent enough to justify all this work. Not all VoltDB customers will be operating at that scale, but John Hugg has said that it’s designed for “Internet scale”. Dr. Stonebraker is right that there’s no substitute for actual measurement of how likely partition is.

Putting the burden on application programmers

Serious production databases are usually manged by database experts/administrators, who decide where to replicate what, whether and how to partition tables (across nodes), and so on.

But with VoltDB, the application developers have to understand a lot about this. For example, they need to know whether a procedure is single-partitioned, so they can assert that in the code. So they have to know about sharding, where replicas are, and so on. It makes the application brittle insofar as changes by the database administrators could break those assertions.

For example, a VoltDB engineer explained to a customer: “The goal of VoltDB is to optimize single-partition transactions and part of the responsibility for that falls on the application developer. You must write the queries to operate properly within a single partition and then declare the procedure to be single-partitioned. [...] Today, VoltDB does not verify that the SQL queries within a single-partitioned procedure are actually single-partitioned.” Another VoltDB engineer said: “The vast majority (almost 100%) of your stored procedure invocations must be single-partition for VoltDB to be useful to you.”

Different “NoSQL” systems also put such burdens on application programmers to greater or lesser degrees, as well. RDBMS’s have traditionally boasted that they hide these issues from application programmers. VoltDB uses SQL, but what it provides is very different from the original concept of the relational model.

What is a “SQL” database system?

You can see more of this in their “VoltDB do’s and don’ts list” Perhaps the most important point is the first “Don’t”: “Don’t use ad hoc SQL queries as part of a production application.” Dr. Stonebraker’s talk is very much a defense of using SQL for OLTP, rather than the “NoSQL” models such as key-value stores. But what does the restriction against “ad hoc” queries mean?

The original fundamental claim of relational DBMS’s (as opposed to the previous generation, the CODASYL-type DBMS’s) is that you don’t have know the access pattern; you just say what you want in SQL, and the DBMS figures out how to do it. Applications keep working even if there are changes in the storage layout, indexing, and whatever else the DBMS uses. But, as a VoltDB engineer said, “Part of VoltDB’s underlying premise is that workloads are known in advance.”

Even though VoltDB uses SQL, maybe it isn’t as far from the “NoSQL” storage engines as one might think!

The Failure of Lisp? A Reply To Brandon Werner

Brandon Werner has written an excellent, thought-provoking post on his blog entitled “The Rise Of Functional Programming: F#/Scala/Haskell and the failing of Lisp”.

He currently works for Microsoft, but I think it’s clear that this posting reflects his own opinions rather than being the corporate voice of Microsoft. He clearly has lot of real experience with Common Lisp and knows whereof he speaks, and I take him very seriously. I started to write a comment for the blog, but it got too big for that. So, here’s my open reply to Brandon:

Hi. I am one of those 50 year old men. (Well, I’ll be 50 in January, and I don’t use IRC.) I was one of the designers of Common Lisp and one of the co-founders of Symbolics. I wrote the second Emacs ever, in Lisp, following along as RMS developed the first Emacs (in TECO), when I was a teenager.

I currently work at ITA Software, a.k.a “the 800-pound gorilla of Common Lisp”. If you use Orbitz and ask “how do I get from Boston to Chicago on 10/4/2000 at 2pm …”, we provide an excellent set of choices of the cheapest routes and fares for which seats are available. This program, known as QPX, is written in Common Lisp.

I am working on our new product, an airline reservation system. It’s an online transaction-processing system that must be up 99.99% of the time, maintaining maximum response time (e.g. on www.aircanada.com). It’s a very, very complicated system. The presentation layer is written in Java using conventional techniques. The business rule layer is written in Common Lisp; about 500,000 lines of code (plus another 100,000 or so of open source libraries). The database layer is Oracle RAC. We operate our own data centers, some here in Massachusetts and a disaster-recovery site in Canada (separate power grid).

There are currently a total of eleven implementations of Common Lisp being actively maintained; see my survey. I see you are clearly much more interested in the seven free ones than the four for-money ones. At ITA Software, QPX uses SBCL, and the airline reservation system uses Clozure Common Lisp (formerly known as OpenMCL). There are a bunch of reasons we use different dialects, including historical and business ones that don’t apply to anyone but us. They’re both great. SBCL takes longer to compile but generates better code, which is more important for QPX (totally compute-bound) than for the airline reservation system (TP). Personally, I’d recommend these two, but reasonable people differ.

I am distressed and sad to hear that the community is judgmental and unfriendly to newcomers and thorny and un-inspiring. I have heard this same criticism from other people than you, and at this point I assume it must really be true. My own point of view is, of course, entirely different from that of newcomers, so it’s probably harder for me to see that this is going on. Indeed, to me it seems that people do get answers on comp.lang.lisp and LispForum, and the tone doesn’t seem so nasty to me, usually. Maybe I’m just not “getting it”.

I would truly appreciate if you could let me know more specifically what kind of incidents you have encountered. (Was this mainly on IRC?) Putting people off like that is, in my opinion, rude, unethical, and obviously very bad for Lisp.

Please send me side mail about ASDF being incompatible in the two implementations. It’s certainly not supposed to be, and I’m not getting this from your installation writeup. Thank you. I agree that installation could be made more friendly and beginner-oriented for all these implementations. It should be much more “Batteries Included”. There has been at least one serious attempt to deal with this: Google for “Lisp in a Box” (and Peter Siebel has plans to improve it even more) . But your general point is still right; we need much more of what “Lisp in a Box” is doing. The fact that too little effort has been put into this is part of a greater issue about the attitude of the community; more below.

You say “.. Common Lisp showed its failure as a community by sitting out this enthusiasm …” and I agree completely. In my opinion, this is part of the same issue about the attitude of the community.

Here’s another one you didn’t mention. Look at www.python.org and www.ruby-lang.com. They immediately tell you what the language is and what’s good about it (and they’re attractive). Now look at www.lisp.org. This will be vastly improved shortly, but the point is that it is yet another reflection of the same attitude issue.

The problems are that there is hardly any organized “Lisp community” at all, and that few people have been putting serious effort into trying to publicize and promote Lisp, and get new people involved.

The only companies with a direct financial interest, as far as I know, are the four for-money Lisp vendors. Franz has made some serious efforts, but what I understand now is that they’ve had so much trouble making money from selling Lisp that they’re spending less effort on that, and more effort on developing new products (many built on Lisp underneath). The other companies are basically too small to be effective.

Open source communities, like those around Python and Ruby, have been very effective. Those of us who’d like to see Lisp promoted need to understand more about how new languages have been so successful at this, and then rouse people to undertake the work of making it all happen. Certainly part of that is to provide a friendly community who can help and encourage new users.

I hope we can address these issues and get some real work going at the next International Lisp Conference. It’s at MIT, Mar 22-25; I’m general chair. (There’s not much info at the web site yet, but it’s coming soon.)

Why Did Symbolics Fail?

In a comment on a previous blog entry, I was asked why Symbolics failed. The following is oversimplified but should be good enough. My old friends are very welcome to post comments with corrections or additions, and of course everyone is invited to post comments.

First, remember that at the time Symbolics started around 1980, serious computer users used timesharing systems. The very idea of a whole computer for one person was audacious, almost heretical. Every computer company (think Prime, Data General, DEC) did their own hardware and their own software suite. There were no PCs’, no Mac’s, no workstations. At the MIT Artificial Intelligence Lab, fifteen researchers shared a computer with a .001 GHz CPU and .002 GB of main memory.

Symbolics sold to two kinds of customers, which I’ll call primary and secondary. The primary customers used Lisp machines as software development environments. The original target market was the MIT AI Lab itself, followed by similar institutions: universities, corporate research labs, and so on. The secondary customers used Lisp machines to run applications that had been written by some other party.

We had great success amongst primary customers. I think we could have found a lot more of them if our marketing had been better. For example, did you know that Symbolics had a world-class software development environment for Fortran, C, Ada, and other popular languages, with amazing semantics-understanding in the editor, a powerful debugger, the ability for the languages to call each other, and so on? We put a lot of work into those, but they were never publicized or advertised.

But we knew that the only way to really succeed was to develop the secondary market. ICAD made an advanced constraint-based computer-aided design system that ran only on Symbolics machines. Sadly, they were the only company that ever did. Why?

The world changed out from under us very quickly. The new “workstation” category of computer appeared: the Suns and Apollos and so on. New technology for implementing Lisp was invented that allowed good Lisp implementations to run on conventional hardware; not quite as good as ours, but good enough for most purposes. So the real value-added of our special Lisp architecture was suddenly diminished. A large body of useful Unix software came to exist and was portable amongst the Unix workstations: no longer did each vendor have to develop a whole software suite. And the workstation vendors got to piggyback on the ever-faster, ever-cheaper CPU’s being made by Intel and Motorola and IBM, with whom it was hard for Symbolics to keep up. We at Symbolics were slow to acknowledge this. We believed our own “dogma” even as it became less true. It was embedded in our corporate culture. If you disputed it, your co-workers felt that you “just didn’t get it” and weren’t a member of the clan, so to speak. This stifled objective analysis. (This is a very easy problem to fall into — don’t let it happen to you!)

The secondary market often had reasons that they needed to use workstation (and, later, PC) hardware. Often they needed to interact with other software that didn’t run under Symbolics. Or they wanted to share the cost of the hardware with other applications that didn’t run on Symbolics. Symbolics machines came to be seen as “special-purpose hardware” as compared to “general-purpose” Unix workstations (and later Windows PCs). They cost a lot, but could not be used for the wider and wider range of available Unix software. Very few vendors wanted to make a product that could only run on “special-purpose hardware”. (Thanks, ICAD; we love you!)

Also, a lot of Symbolics sales were based on the promise of rule-based expert systems, of which the early examples were written in Lisp. Rule-based expert systems are a fine thing, and are widely used today (but often not in Lisp). But they were tremendously over-hyped by certain academics and by their industry, resulting in a huge backlash around 1988. “Artificial Intelligence” fell out of favor; the “AI Winter” had arrived.

(Symbolics did launch its own effort to produce a Lisp for the PC, called CLOE, and also partnered with other Lisp companies, particularly Gold Hill, so that customers could develop on a Symbolics and deploy on a conventional machine. We were not totally stupid. The bottom line is that interest in Lisp just declined too much.)

Meanwhile, back at Symbolics, there were huge internal management conflicts, leading to the resignation of much of top management, who were replaced by the board of directors with new CEO’s who did not do a good job, and did not have the vision to see what was happening. Symbolics signed long-term leases on big new offices and a new factory, anticipating growth that did not come, and were unable to sublease the properties due to office-space gluts, which drained a great deal of money. There were rounds of layoffs. More and more of us realized what was going on, and that Symbolics was not reacting. Having created an object-oriented database system for Lisp called Statice, I left in 1988 with several co-workers to form Object Design, Inc., to make an object-oriented database system for the brand-new mainstream object-oriented language, C++. (The company was very successful and currently exists as the ObjectStore division of Progress Software (www.objectstore.com). I’m looking forward to the 20th-year reunion party next summer.)

Symbolics did try to deal with the situation, first by making Lisp machines that were plug-in boards that could be connected to conventional computers. One problem is that they kept betting on the wrong horses. The MacIvory was a Symbolics Ivory chip (yes, we made our own CPU chips) that plugged into the NuBus (oops, long-since gone) on a Macintosh (oops, not the leading platform). Later, they finally gave up on competing with the big chip makers, and made a plug-in board using a fast chip from a major manufacturer: the DEC Alpha architecture (oops, killed by HP/Compaq, should have used the Intel). By this time it was all too little, too late.

The person who commented on the previous blog entry referred by to an MIT Masters thesis by one Eve Philips (see http://www.sts.tu-harburg.de/~r.f.moeller/symbolics-info/ai-business.pdf) called “If It Works, It’s Not AI: A Commercial Look at Artificial Intelligence Startups”. This is the first I’ve heard of it, but evidently she got help from Tom Knight, who is one of the other Symbolics co-founders and knows as much or more about Symbolics history than I. Let’s see what she says.

Hey, this looks great. Well worth reading! She definitely knows what she’s talking about, and it’s fun to read. It brings back a lot of old memories for me. If you ever want to start a company, you can learn a lot from reading “war stories” like the ones herein.

Here are some comments, as I read along. Much of the paper is about the AI software vendors, but their fate had a strong effect on Symbolics.

Oh, of course, the fact that DARPA cut funding in the late 80′s is very important. Many of the Symbolics primary-market customers had been ultimately funded by DARPA research grants.

Yes, there were some exciting successes with rule-based expert systems. Inference’s “Authorizer’s Assistant” for American Express, to help the people who talk to you on the phone to make sure you’re not using an AmEx card fraudulently, ran on Symbolics machines. I learn here that it was credited with a 45-67% internal rate of return on investment, which is very impressive.

The paper has an anachronism: “Few large software firms providing languages (namely Microsoft) provide any kind of Lisp support.” Microsoft’s dominance was years away when these events happened. For example, remember that the first viable Windows O/S, release 3.1, came out in in 1990. But her overall point is valid.

She says “There was a large amount of hubris, not completely unwarranted, by the AI community that Lisp would change the way computer systems everywhere ran.” That is absolutely true. It’s not as wrong as it sounds: many ideas from Lisp have become mainstream, particularly managed (garbage-collected) storage, and Lisp gets some of the credit for the acceptance of object-oriented programming. I have no question that Lisp was a huge influence on Java, and thence on C#. Note that the Microsoft Common Language Runtime technology is currently under the direction of the awesome Patrick Dussud, who was the major Lisp wizard from the third MIT-Lisp-machine company, Texas Instruments.

But back then we really believed in Lisp. We felt only scorn for anyone trying to write an expert system in C; that was part of our corporate culture. We really did think Lisp would “change the world” analogously to the way “sixties-era” people thought the world could be changed by “peace, love, and joy”. Sorry, it’s not that easy.

Which reminds me, I cannot recommend highly enough the book “Patterns of Software: Tales from the Software Community” by Richard Gabriel (http://www.dreamsongs.com/Files/PatternsOfSoftware.pdf) regarding the process by which technology moves from the lab to the market. Gabriel is one of the five main Common Lisp designers (along with Guy Steele, Scott Fahlman, David Moon, and myself), but the key points here go way beyond Lisp. This is the culmination of the series of papers by Gabriel starting with his original “Worse is Better”. Here the ideas are far more developed. His insights are unique and extremely persuasive.

OK, back to Eve Philips: at chapter 5 she describes “The AI Hardware Industry”, starting with the MIT Lisp machine. Does she get it right? Well, she says “14 AI lab hackers joined them”; see my previous post about this figure, but in context this is a very minor issue. The rest of the story is right on. (She even mentions the real-estate problems I pointed out above!) She amply demonstrates the weaknesses of Symbolics management and marketing, too. This is an excellent piece of work.

Symbolics was tremendously fun. We had a lot of success for a while, and went public. My colleagues were some of the skilled and likable technical people you could ever hope to work with. I learned a lot from them. I wouldn’t have missed it for the world.

After I left, I thought I’d never see Lisp again. But now I find myself at ITA Software, where we’re writing a huge, complex transaction-processing system (a new airline reservation system, initially for Air Canada), whose core is in Common Lisp. We almost certainly have the largest team of Common Lisp programmers in the world. Our development environment is OK, but I really wish I had a Lisp machine again.

What Does the Proof of the “CAP theorem” Mean?

Several years back, Eric Brewer of U.C. Berkeley presented the “CAP conjecture”, which he explained in these slides from his keynote speech at the PODC conference in 2004. The conjecture says that a system cannot be consistent, available, and partition-tolerant; that is, it can have two of these properties, but not all three. This idea has been very influential.

Seth Gilbert and Nancy Lynch, of MIT, in 2002, wrote a now-famous paper called “Brewer’s Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services”. It is widely said that this paper proves the conjecture, which is now considered a theorem. Gilbert and Lynch clearly proved something, but what does the proof mean by “consistency”, “availability”, and “partition-tolerance”?

Many people refer to the proof, but not all of them have actually read the paper, thinking that it’s all obvious. I wasn’t so sure, and wanted to get to the bottom of it. There’s something about my personality that drives me to look at things all the way down to the details before I feel I understand. (This is not always a good thing: I sometimes lose track of what I originally intended to do, as I “dive down a rat-hole”, wasting time.) For at least a year, I have wanted to really figure this out.

A week ago, I came across a blog entry called “Availability and Partition Tolerance” by Jeff Darcy. You can’t imagine how happy I was to find someone who agreed that there is confusion about the terms, and that they need to be clarified. Reading Jeff’s post inspired me to finally read Gilbert and Lynch’s paper carefully and write these comments.

I had an extensive email conversation with Jeff, without whose help I could not have written this. I am very grateful for his generous assistance. I also thank Seth Gilbert for helping to clarify his paper for me. I am solely responsible for all mistakes.

I will now explain it all for you. First I’ll lay out the basic concepts and terminology. Then I’ll discuss what “C”, “A”, and “P” mean, and the “CAP theorem”. Next I’ll discuss “weak consistency”, and summarize the meaning of the proof for practical purposes.

Basic Concepts

The paper has terminology and axioms that must be laid out before the proof can be presented.

A distributed system is built of “nodes” (computers), which can (attempt to) send messages to each other over a network. But the network is not entirely reliable. There is no bound on how long a message might take to arrive. This implies that a message might “get lost”, which is effectively the same as taking an extremely long time to arrive. If a node sends a message (and does not see an acknowledgment), it has no way to know whether the message was received and processed or not, because either the request or the response might have been lost.

There are “objects”, which are abstract resources that reside on nodes. Objects can perform “operations” on other objects. Operations are synchronous: some thread issues a request and expects a response. Operations do not request other operations, so they do not do any messaging themselves.

There can be replicas of an object on more than one node, but for the most part that doesn’t affect the following discussion. An operation could “read X and return the value”, “write X”, “add X to the beginning of a queue”, etc. I’ll just say “read” for an operation that has no side-effects and returns some part of the state of the object, and “write” to mean an operation that performs side-effects.

A “client” is a thread running on some node, which can “request” an object (on any node) to perform an operation. The request is sent in a message, and the sender expects a response message, which might returns a value, and which confirms that the operation was performed. In general, more than one thread could be performing operations on one object. That is, there can be concurrent requests.

The paper says: “In this note we will not consider stopping failures, though in some cases a stopping failure can be modeled as a node existing in its own unique component of a partition.” Of course in any real distributed system, nodes can crash. But for purposes of this paper, a crash is considered to be a network failure, because from the point of view of another node, there’s no way to distinguish between the two. A crashed node behaves exactly like a node that’s off the network.

You might say that if a node goes off the network and comes back, that’s not the same as a crash because the node loses its volatile state. However, this paper does not concern itself with a distinction between volatile and durable memory.  There’s no problem with that; issues of what is “in RAM” versus “on disk” are orthogonal to what this paper is about.

Consistent

The paper says that consistency “is equivalent to requiring requests of the distributed shared memory to act as if they were executing on a single node, responding to operations one at a time.” They explain this more explicitly by saying that consistency is equivalent to requiring all operations (in the whole distributed system) to be “linearizable”.

“Linearizability” is a formal criterion presented in the paper “Linearizability: A Correctness Condition for Concurrent Objects”, by Maurice Herlihy and Jeannette Wing. It means (basically) that operations behave as if there were no concurrency.

The linearizability concept is based a model in which there is a set of threads, each of which can send an operation to an object, and later receive a response. Despite the fact that the operations from the different threads can overlap in time in various ways, the responses are as if each operation took place instantaneously, in some order. The order must be consistent with each thread’s own order, so that a read operation in a thread always sees the results of that thread’s own writes.

Linearizability per se does not include failure atomicity, which is the “A” (“atomic”) in “ACID”. But Gilbert and Lynch assume no node failures. So operations are atomic: they always run to completion, even if their response messages get lost.

So by “consistent” (“C”), the paper means that every object is linearizable. (That’s not what the “C” in “ACID” means, by the way, but that’s not important.) Very loosely, “consistent” means that if you get a response, it has the right answer, despite concurrency.

This is not what the “C” in “ACID transaction” means. It’s what the “I” means, namely “isolation” from concurrent operations. This is probably a source of confusion sometimes.

Furthermore, the paper says nothing about transactions, which have would have a beginning, a sequence of operations, and an end, which may commit or abort. “ACID” is talking about the entire transaction. The “linearizability” criterion only talks about individual operations on objects. (So the whole “ACID versus BASE” business, while cute, can be misleading.)

Available

“Available” is defined as “every request received by a non-failing node in the system must result in a response.” The phrase “non-failing node” seemed to imply that some nodes might be failing and others not. But since the paper postulates that nodes never fail, I believe the phrase is redundant, and can be ignored. After the definition, the paper says “That is, any algorithm used by the service must eventually terminate.”

The problem here is that “eventually” could mean a trillion years. This definition of “available” is only useful if it includes some kind of real-time limit: the response must arrive within a period of time, which I’ll call the maximum latency.

Next, it’s very important to notice that “A” says nothing about the content of the response. It could be anything, as far as “A” is concerned; it need not be “successful” or “correct”. (If think otherwise, see section 3.2.3.)

So “available” (“A”) means: If a client sends a request to a node, it always gets back some response within L time, but there is no guarantee about contents of the response.

Partition Tolerant

There is no definition, per se, of the term “partition-tolerant”, not even in section 2.3, “Partition Tolerance”.

First, what is a “partition”? They first define it to mean that there is a way to assort all the nodes into separate sets, which they call “components”, and all messages sent from a node in one component to another nodes in a separate component are lost. But then they go on to say “And any pattern of message loss can be modeled as a temporary partition separating the communicating nodes at the exact instance the message is lost.” or their formal purposes, “partition” simply means that a message can be lost. (The whole “component” business can be disregarded.)  That’s probably not what you had in mind!

In real life, some messages are lost and some aren’t, and it’s not exactly clear when a “partition” situation starts, is happening, or ends. I realize that for practical purposes, we usually know what a partition means, but if we’re going to do formal proofs and understand what was proved, one must be completely clear about these terms.

Even in a local-area network, packets can be dropped. Protocols like TCP re-transmit packets until the destination acknowledges that they have arrived. If that happens, it’s clearly not a network failure from the point of view of the application. “Losing messages” must have something to do with nodes entirely unable to communicate for a “long” time compared to the latency requirements of the system.

Furthermore, remember that node failure is treated as a network failure.

So “partition-tolerant” (“P”) means that any guarantee of consistency or availability is still guaranteed even if there is a partition. In other words, if a system is not partition-tolerant, that means that if the network can lose messages or any nodes can fail, then any guarantee of atomicity or consistency is voided.

CAP

The CAP theorem says that a distributed system as described above cannot have properties C, A, and P all at the same time. You can only have two of them. There are three cases:

AP: You are guaranteed get back responses promptly (even with network partitions), but you aren’t guaranteed anything about the value/contents of the response. (See section 3.2.3.) A system like this is entirely useless, since any answer can be wrong.

CP: You are guaranteed that any response you get (even with network partitions) has a consistent (linearizable) result. But you might not get any responses whatsoever. (See section 3.2.1.) This guarantee is also completely useless, since the entire system might always behave as if it were totally down.

CA: If the network never fails (and nodes never crash, as they postulated earlier), then, unsurprisingly, life is good. But if messages could be dropped, all guarantees are off. So a CA guarantee is only useful in a totally reliable system.

At first, this seems to mean that practical, large distributed systems (which aren’t entirely reliable) can’t make any useful guarantees! What’s going on here?

Weak Consistency

Large-scale distributed systems that must be highly available can provide some kind of “weaker” consistency guarantee than linearizability. Most such systems provide what they call “eventual consistency” and may return “stale data”.

For some applications, that’s OK. Google search is an obvious case: the search is already specified/known to be using “stale” data (data since the last time Google looked at the web page), so as long as partitions are fixed quickly relative to the speed of Google’s updating everything, (and even if sometimes not, for that matter), nobody is going to complain.

Just saying that results “might be stale” and will be “eventually consistent” is unfortunately vague. How stale can it be, and how long is “eventually”? If there’s no limit, then there’s no useful guarantee.

For a staleness-type weak consistency guarantee, you’d like to be able to say something like: “operations (that read) will always return a result that was consistent with all the other operations (that write) no longer ago than time X”. And this implies that “write” operations are never lost, i.e. always happen within a fixed time bound.

t-Connected Consistency

Gilbert and Lynch discuss “weakened consistency” in section 4. It’s also about stale data, but with “formal requirements on the quality of stale data returned”. They call it “t-Connected Consistency”.

It makes two assumptions. (a) Every node has a clock that can be used to do timeouts. The clocks don’t have to be synchronous with each other. (b) There’s some time period after which you can assume that an unanswered message must be lost. (c) Every node processes a received message within a given, known time.

The real definition of “t-Connected Consistency” is too formal for me to explain here (see section 4.4). It (basically) guarantees (1) when there is no partition, the system is fully consistent; (2) if a partition happens, requests can see stale data; and (3) and after the partition is fixed, there’s a time limit on how long it takes for consistency to return.

Are the assumptions OK in practice? Every real computer can do timeouts, so (a) is no problem. You can always ignore any responses to messages after the time period, so (b) is OK. It’s not obvious that every system will obey (c), but some will.

I have two reservations. First, if the network is so big that it’s never entirely working at any one time, what would guarantee (3) mean? Second, in the algorithm in section 4.4, in the second step (“write at node A”), it retries as long as necessary to get a response. But that could exceed L, violating the availability guarantee.

So it’s not clear how attractive t-Connected Consistency really is. It can be hard it is to come up with formal proofs of more complicated, weakened consistency guarantees. Most working software engineers don’t think much about formal proofs, but don’t underrate them. Sometimes they can help you identifying bugs that would otherwise be hard to track down, before they happen.

Jeff Darcy wrote a blog posting about “eventual consistency” about a half year ago, which I recommend. And there are other kinds of weak consistency guarantees, such as the one provided by Amazon’s Dynamo key-store, which worth examining.

Reliable Networks

Can’t you just make the network reliable, so that messages are never lost? (“Never” meaning that the probability of losing a message is as low as other failure mode that you’re not protecting against.)

Lots and lots of experience has shown that in a network with lots of routers and such, no matter how much redundancy you add, you will experience lost messages, and you will see partitions that last for a significant amount of time. I don’t have a citation to prove this, but, ask around and that’s what experienced operators of distributed systems will always tell you.

How many routers is “lots”? How reliable is it if you have no routers (layer 3 switches), only hubs (layer 2 switches)? What if you don’t even have hubs? I don’t have answers to all this. But if you’re going to build a distributed system that depends on a reliable network, you had better ask experienced people about these questions. If it involves thousands of nodes and/or is geographically distributed, you can be sure that the network will have failures.

And again, as far as the proof of the CAP theorem is concerned, node failure is treated as a network failure. Having a perfect network does you no good if machines can crash, so you’d also need each node to be highly-available in and of itself. That would cost a lot more than using “commercial off-the-shelf” computers.

The Bottom Line

My conclusion is that the proof of the CAP theorem means, in practice: if you want to build a distributed system that is (1) large enough that nodes can fail and the network can’t be guaranteed to never lose messages, and (2) you want to get a useful response to every request within a specified maximum latency, then the best you can guarantee about the meaning of the response is that it is guaranteed to have some kind of “weak consistency”, which you had better carefully define in such a way that it’s useful.

P.S.

After writing this but just before posting it, Alex Feinberg added a comment to my previous blog post with a link to this excellent post by Henry Robinson, which discusses many of the same issues and links to even more posts. If you want to read more about all this, take a look.

Learning French

I’m going to France for the next two weeks and thought it would be good to try to learn some of the language. When I was a kid, I had three years of French instruction in school, but it was by far my worst subject.

I got some French language instruction CD’s, which I am listening to in my car as I commute. They’re called “Michel Thomas Method French for Beginners”. I was afraid that these would be just as boring. In fact, they’re great fun to listen to and practice with.

The format is that there’s a French man teaching two students. After teaching a few words and a bit of grammar, he gives them a sentence to say, and I can try to formulate it before they do (or pause the CD), and then I get the immediate feedback.

When I was a kid learning French in school, we had to recite conjugations (“er”, “ir”, and “re” verbs, in many tenses), which was totally boring. Well, it turns out that you can say a whole lot of useful stuff without any of that. You can just use infinitives for all kinds of things (“je voudari manger”). So you only need to learn the present-tense conjugations of a few verbs and you can say all kinds of useful things.

And you don’t need to use the future tense at all, since, just as in English you can say “I am going to eat”, in French you can say literally the same thing, “je vais manger”. Cool.

Now, whether I’ll be able to construct sentences on the fly while someone is standing there waiting for me and evaluating whether I’m right, is another question. I can easily imagine myself being embarrassed; maybe under pressure I’ll forget it all.

Also, whether I can understand what’s spoken to me is yet another issue.

I downloaded some French-English apps for the Android phone, but so far haven’t found any free apps better than Google’s own Translate. I’d rather have one that doesn’t depend on network access, though. I’m still looking.

Anyway, doing this is fun, which is perhaps what really matters. So if you had the same negative language-learning experience as I did, try out doing it this way.

Seed Funding and Angel Groups: The Fast and The Furious

I have written a blog post called Seed Funding and Angel Groups: The Fast and The Furious, which was posted on Dharmesh Shah’s On Startups blog. It’s about the speed at which entrepreneurs can acquire seed financing, whether angel groups or venture capital partnerships can move faster, and the how much all this matters.

I put it on Dharmesh’s blog at his request. If you have comments, and if it’s all the same to you, it’s probably better to put them on his blog rather than here, just to keep all comments in the same place.

Comments on “Urban Myths about NoSQL”

Dr. Michael Stonebraker recently posted a presentation entitled “Urban Myths about NoSQL”. Its primary point is to defend SQL, i.e. relational, database systems against the claims of the new “NoSQL” data stores. Dr. Stonebraker is one of the original inventors of relational database technology, and has been one of the most eminent database researchers and practitioners for decades.

Many of the virtues of relational databases described here are specifically about a new and highly innovative RDDBMS called VoltDB. VoltDB is made by a company called VoltDB.com, of which Dr. Stonebraker is co-founder and CTO. (There is also a good writeup about VoltDB here.)

The following are some comments about four of the six points in the presentation. I don’t consider any of these to “debunk” the presentation or anything like that, but they point out considerations that I feel should be taken into account.

#1: SQL is too slow:

This argument assumes a perfect (or excellent) query optimizer. If you talk to anyone who has ever done a high-performance system in Oracle DB or DB/2, and you will find out about serious problems in query optimizers. I am not saying that rolling-your-own C code is the answer, but query strategies often have to be provided explicitly by the developer or DBA.

Stored procedures have a serious problem: you can’t interleave your own code with database operations. This can particularly be a problem if each stored procedure is its own transaction rather than an operation within a transaction, as in VoltDB. Existing large systems may not be able to operate within that constraint, although new systems designed with that in mind might not have any problem witht this.

The “to go a lot faster” requires the whole database to be in main memory, as it is with VoltDB (the points on the slides here do not apply to RDBMS’s other than VoltDB.) The reason VoltDB can get rid of buffer management is that there are no (disk) buffers. VoltDB need not do lock management because there is no concurrency control: you just run every transaction to completion, since there is no reason to interleave transactions, since there are no I/O waits.

This is great if it works for your application. In point #5, he says that most OLTP databases are not very big, e.g. < 1TB, and for a database that size, using main memory is quite feasiable these days. The requiredment for the sizes of OLTB databases will probably rise with time. Of course, computers and memory are also getting faster and larger for the same price.

#3: SQL Systems don’t scale

If you have ever been in involved in benchmarking, you know how difficult it is to interpret benchmark results. Is it possible that these results were obtained by choosing a benchmark that is particularly favorable to VoltDB? The only benchmark that really matters is your own application: they are all different. Of course, the problem with that is that it’s hard to port your application merely to test performance. But by ignoring that and looking at other benchmarks, it’s like looking for a lost key under the streetlight because it’s easier to look there. I’m not saying that these numbers are misleading, and certainly not that they are intentionally misleading, but they are very hard to interpret without knowing exactly what was benchmarked, how everything was tuned, and so on. I say this from my own experience, having done benchmarking of database systems for years.

(Also notes that by TPC-C, he does not mean the officially defined TPC-C benchmark; look it up and you’ll see that it is a huge, major project to do it. He means a very simplified example based on the key concepts in TPC-C. (You can see this in the academic papers by him and others.) That said, if you do want a micro-benchmark that is as close to what people agree to be a good measure of online transaction performance, this might be the best one can do.)

#5: ACID is too slow

ACID is great for software developers, providing them a very clean and easy-to-understand model. Ease of understanding is crucial for achiving simplicity, which is the Holy Grail of software developement, enhancing maintainability and correctness. I’m all for ACID.

To clarify something often not explained well: the NoSQL stores are ACID. It’s just that what they can do within one ACID transaction is usually quite limited. For example, a transaction might only be able to fetch a value (or store a value, or increment a value) given the key, and then the transaction is over. That operation is ACID.

In a classic RDBMS, you can do many operations within one transaction. Your program says “begin transaction” (sometimes this is tacit), and then you can do computations that include both code and database queries/updates, interleaved. At the end you say “commit transaction”. (During or at the end of a transaction, the DBMS might have to abort the transaction.)

Right now, very few DBMS’s provide true ACID properties in the way they are really used in practice, for two reasons. First, they run at reduded “isolation levels”, which means that the “I” in ACID is compromised. See my blog article for an explanation of this.

Second, one often wants to provide a way to recover from the failure of an entire data center. This is done by having a second data center that is far enough away that it won’t be damaged by the failure of the primary data center. This means you can keep going in the face of a “disaster” such as a regional power outage, a tsunami, etc.

The problem is that if the data center is far enough away to have truly independent failure modes, then the network connection will have latency so high that it is not feasible to do synchronous commits for every transaction that update the distant copy. Most often, commit results are sent asynchronously to the distant copy. If the local data center fails, any transactions that had beeen committed, but had not yet reached the distant copy, are lost. So these transactions were not durable, the “D” in ACID. So there is a tradeoff here. (People live with this by being willing to do manual fixups in the face of a disaster.)

As discussed above, VoltDB transaction do not allow you to interleave code in your application with transactions. (The stored procedures can run arbitrary code, in Java, but that’s not the same what I described above.)

#6: In CAP, choose AP over CA

I disagree that network partitions are not a major concern. Very simple local-area networks do not suffer from partitions and network failures much, but even a medium-size network is vulnerable, and networks in large data centers are quite vulnerable, as you can easily learn from network operations experts. For example, routers fail, or are misconfigured.

Both Amazon and Google have published papers about their large-scale data stores. The papers talk a lot about how they deal with network partitions. If partitions were so unlikey, why are these large companies taking the problem so seriously, and using rather sophisticated techniques to deal with the partitions? Also, the study of how to deal with network partitions has been a hot topic of research for the last 35 years; again, why would that be true if partitions were not an important concern?

So, as your network becomes larger and more complex, dealing with partitions becomes more and more of an issue. My impression (I may be wrong) is that the “sweet spot” for VoltDB, at least at the moment, is for distributed systems that are not at the kind of very-large scale of an Amazon or Google, and indeed for a much smaller scale, which makes network partitions much less of a problem. There’s nothing wrong with this at all; I’m just trying to clarify the issue and explain the reason for the controversy about this point.

Final Note

There has been an exciting explosion of innovative database technology in the last few years. Many different kinds of applications have different requirements. It’s great news for all of us that there are so many solutions at different points in the requirement space.

What are “Human-Generated Data” and “In-RAM Databases”?

For thoughtful commentary on all kinds of database and data storage systems, one of the best sources is Curt Monash’s DBMS2 blog.  Recently he posted an article called Traditional Databases will eventually wind up in RAM.  I have two comments about his points from that article.

Human-Generated Data

I’m still not totally comfortable with Curt’s distinction between “human-generated” and “machine-generated” data.  Data from humans always goes through machines, so at some level all data is machine-generated. I think what you’re saying is that the number of humans is roughly constant (on the time scale you mean), and they only have so much time in a day to key in data, etc.  But what about trends that create more bits from any particular bit of human activity?
In the old days, records in databases were created when a person “keyed in” some fields.  Now, data is generated every time you click on something.  As data systems increase in capacity, won’t computers start gathering more and more data for each human interaction?  For example, every time I click, the system records what I clicked on, plus such context the entire contents of what was on my browser screen, how long it was since my last click, plus the times for each of the previous 1,000 clicks, everything it currently (this keeps changing) knows about my buying habits, etc.
That may be far-fetched, but I’m not so sure: betting on things staying the same size as they are has usually turned out to be less than prescient.  In any case, the underlying principle is analogous to the “Freeway Effect”: if there are higher data rates and databases, there will never be “enough”.

We’ll find more data to transmit and more to store, forever and ever.

 

In-RAM Database Systems

Having a database “in RAM” can mean more than one thing.

In traditional DBMS design, data “in RAM” is vulnerable to a very common failure mode, namely, the machine crashing.  So no database data is considered to be durable (the “D” in “ACID”) until it has been written to disk, which is less vulnerable, especially if you use RAID, etc.  So traditionally writes are sent to a log and forced to disk.  You can still keep the data itself in RAM, but recovery from the log will take longer and longer as the log grows in size, so you “checkpoint” the data by writing it out to disk.  That can be done in the background if everything is designed properly.  This is utterly standard.

It’s also traditional that there isn’t enough RAM to hold the whole database, so RAM is used as a cache.  This creates some issues when you have to write a modified page back to disk NOT as part of a checkpoint, and there are very standard ways to deal with that.

“In RAM” can mean (a) as above, but ususally/always the RAM cache is so big that you never overflow the cache; (b) the database system is designed so that data must fit in RAM, which can simplify buffer management and recovery algorithms; (c) you get around the machine-crash problem some way or other and really do keep everything only in RAM.

One way to do (c) is to keep all data in (at least) two copies, such that they’ll never both be down.  This requires that the machines (1) have very, very independent failures modes, which is not as easy to do as one might think, and (2) get fixed very quickly, since while one is down you have fewer copies.  Issue (2) is one reason to keep more than two copies; usually three copies are recommended, with one being at a “distant” data center.

This approach can be used for the log even if not for the whole DBMS.  HFS, the Hadoop File System, and VoltDB consider this the preferred/canonical way to go.  In both cases, some users still feel uncomfortable with approach (c), and so both have put in ways to commit the log to a conventional disk.  The hope is that as approach (c) proves itself in real production environments over the years, it will be more and more accepted.