Archive for the ‘Database’ Category

What Does the Proof of the “CAP theorem” Mean?

Monday, July 12th, 2010

Several years back, Eric Brewer of U.C. Berkeley presented the “CAP conjecture”, which he explained in these slides from his keynote speech at the PODC conference in 2004. The conjecture says that a system cannot be consistent, available, and partition-tolerant; that is, it can have two of these properties, but not all three. This idea has been very influential.

Seth Gilbert and Nancy Lynch, of MIT, in 2002, wrote a now-famous paper called “Brewer’s Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services”. It is widely said that this paper proves the conjecture, which is now considered a theorem. Gilbert and Lynch, clearly proved something, but what does the proof mean by “consistency”, “availability”, and “partition-tolerance”?

Many people refer to the proof, but not all of them have actually read the paper, thinking that it’s all obvious. I wasn’t so sure, and wanted to get to the bottom of it. There’s something about my personality that drives me to look at things all the way down to the details before I feel I understand. (This is not always a good thing: I sometimes lose track of what I originally intended to do, as I “dive down a rat-hole”, wasting time.) For at least a year, I have wanted to really figure this out.

A week ago, I came across a blog entry called “Availability and Partition Tolerance” by Jeff Darcy. You can’t imagine how happy I was to find someone who agrees that there is confusion about the terms, and that they need to be clarified. Reading Jeff’s post inspired me to finally read Gilbert and Lynch’s paper carefully and write these comments.

I had an extensive email conversation with Jeff, without whose help I could not have written this. I am very grateful for his generous assistance. I also thank Seth Gilbert for helping to clarify his paper for me. I am solely responsible for all mistakes.

I will now explain it all for you. First I’ll lay out the basic concepts and terminology. Then I’ll discuss what “C”, “A”, and “P” mean, and the “CAP theorem”. Next I’ll discuss “weak consistency”, and summarize the meaning of the proof for practical purposes.

Basic Concepts

A distributed system is built of “nodes” (computers), which can (attempt to) send messages to each other over a network. But the network is not entirely reliable. There is no bound on how long a message might take to arrive. This implies that it might “get lost”, which is effectively the same as taking an extremely long time to arrive. If a node sends a message (and does not see an acknowledgment), it has no way to know whether the message was received and processed or not (because either the request or the response might have been lost).

There are “objects”, which are abstract resources that reside on nodes. Objects can perform “operations” on those objects. Operations are synchronous: some thread issues a request and expects a response. Operations do not request other operations, so they do not do any messaging themselves.

(There can be replicas of an object on more than one node, but for the most part that doesn’t affect the following discussion. An operation could be “read X and return the value”, “write X”, “add X to the beginning of a queue”, etc. I’ll just say “read” for an operation that has no side-effects and returns some part of the state of the object, and “write” to mean an operation that performs side-effects.)

A “client” is a thread running on some node, which can “request” an object (on any node) to perform an operation. The request is sent in a message, and the sender expects a response message, which might returns a value, and which confirms that the operation was performed. In general, more than one thread could be performing operations on one object. That is, there can be concurrent requests.

The paper says: “In this note we will not consider stopping failures, though in some cases a stopping failure can be modeled as a node existing in its own unique component of a partition.” Of course in any real distributed system, nodes can crash. But for purposes of this paper, a crash is considered to be a network failure, because from the point of view of another node, there’s no way to distinguish between the two. A crashed node behaves exactly like a node that’s off the network.

Consistent

The paper says that consistency “is equivalent to requiring requests of the distributed shared memory to act as if they were executing on a single node, responding to operations one at a time.”

Paraphrased, this means consistency is equivalent to requiring all operations (in the whole distributed system) to be linearizable.

“Linearizability” is a formal criterion presented in the paper “Linearizability: A Correctness Condition for Concurrent Objects”, by Maurice Herlihy and Jeannette Wing. It means (basically) that operations behave as if there were no concurrency.

That is, although there is a set of threads, each of which can send an operation to an object, and later receive a response. Despite the fact that the operations from the different threads can overlap in time in various ways, the responses are as if each operation took place instantaneously, in some order. The order must be consistent with each thread’s own order, so that a read operation in a thread always sees the results of that thread’s writes.

Linearizability per se does not include failure atomicity, which is the “A” (“atomic”) in “ACID”. But Gilbert and Lynch assume no node failures. So operations always run to completion, and so they are “atomic” as well. (But their response messages can get lost.)

So by “consistent” (“C”), the paper means that every object is linearizable. (That’s not what the “C” in “ACID” means, by the way, but that’s not important.) Very loosely, “consistent” means that if you get a response, it has the right answer.

This is not what the “C” in “ACID transaction” means. It’s what the “I” means, namely “isolation” from concurrent operations. This is probably a source of confusion sometimes.

Furthermore, the paper says nothing about transactions, which have would have a beginning, a sequence of operations, and an end, which may commit or abort. “ACID” is talking about the entire transaction. The “linearizability” criterion only talks about individual operations on objects. (So the whole “ACID versus BASE” business, while cute, can be misleading.)

Available

“Available” is defined as “every request received by a non-failing node in the system must result in a response.” The phrase “non-failing node” seemed to imply that some nodes might be failing and others not. But since the paper postulates that nodes never fail, I believe the phrase is redundant, and can be ignored.

After the definition, the paper says “That is, any algorithm used by the service must eventually terminate.” Does that mean the response might not arrive for a trillion years? That’s the same as never terminating, from a practical point of view!

This definition of “available” is only useful if it includes some kind of real-time limit: the response must arrive within a period of time, which I’ll call the maximum latency.

Next, it’s very important to notice that “A” says nothing about the content of the response. It could be anything, as far as “A” is concerned; it need not be “successful” or “correct”. (If think otherwise, see section 3.2.3.)

So “available” (“A”) means: If a client sends a request to a node, it always gets back some response within L time, but there is no guarantee about contents of the response.

Partition Tolerant

There is no definition, per se, of the term “partition-tolerant”, not even in section 2.3, “Partition Tolerance”.

First, what is a “partition”? They first define it to mean that there is a way to assort all the nodes into separate sets, which they call “components”, and all messages sent from a node in one component to another nodes in a separate component are lost. But then they go on to say “And any pattern of message loss can be modeled as a temporary partition separating the communicating nodes at the exact instance the message is lost.” That’s probably not what you had in mind!

In real life, some messages are lost and some aren’t, and it’s not exactly clear when a “partition” situation starts, is happening, or ends. I realize that for practical purposes, we usually know what a partition means, but if we’re going to do formal proofs and understand what was proved, one must be completely clear about these terms.

Even in a local-area network, packets can be dropped. Protocols like TCP re-transmit packets until the destination acknowledges that they have arrived. If that happens, it’s clearly not a network failure from the point of view of the application. “Losing messages” clearly has something to do with nodes entirely unable to communicate for a “long” time compared to the latency requirements of the system.

Taking that into account: for their formal purposes, “partition” simply means that a message can be lost. (The whole “component” business can be disregarded.)

Furthermore, remember that node failure is treated as a network failure.

So “partition-tolerant” (“P”) means that any guarantee of atomicity or availability is still guaranteed even if there is a partition. In other words, if a system is not partition-tolerant, that means that if the network can lose messages or any nodes can fail, then any guarantee of atomicity or consistency is voided.

CAP

The CAP theorem says that a distributed system as described above cannot have properties C, A, and P all at the same time. You can only have two of them. There are three cases:

AP: You are guaranteed get back responses promptly, but you aren’t guaranteed anything about the value/contents of the response. (See section 3.2.3.) This guarantee is completely useless.

CP: You are guaranteed that any response you get has a consistent (linearizable) result. But you might not get any responses whatsoever. (See section 3.2.1.) This guarantee is also completely useless, since the entire system might behave as if it were totally down.

CA: If the network never fails (and nodes never crash, as they postulated earlier), then, unsurprisingly, life is good. But if any messages are dropped, all guarantees are off. So a CA guarantee is also completely useless. (Note that even if only one node is unreachable, you still can’t guarantee CA, since any request might need a resource on that one node.)

At first, this seems to mean that practical, large distributed systems can’t make any useful guarantees! What’s going on here?

Weak Consistency

Large-scale distributed systems that must be highly available can provide some kind of “weaker” consistency guarantee than linearizability. Most such systems provide what they call “eventual consistency” and may return “stale data”.

For some applications, that’s OK. Google search is an obvious case: the search is already specified/known to be using “stale” data (data since the last time Google looked at the web page), so as long as partitions are fixed quickly relative to the speed of Google’s updating everything, (and even if sometimes not, for that matter), nobody is going to complain.

Just saying that results “might be stale” and will be “eventually consistent” is unfortunately vague. How stale can it be, and how long is “eventually”? If there’s no limit, then there’s no useful guarantee.

For a staleness-type weak consistency guarantee, you’d like to be able to say something like: “operations (that read) will always return a result that was consistent with all the other operations (that write) no longer ago than time X”. And this implies that “write” operations are never lost, i.e. always happen within a fixed time bound.

Gilbert and Lynch discuss “weakened consistency” in section 4. It’s also about stale data, but with “formal requirements on the quality of stale data returned”. They call it “t-Connected Consistency”.

It makes two assumptions. (a) Every node has a clock that can be used to do timeouts. The clocks don’t have to be synchronous with each other. (b) There’s some time period after which you can assume that an unanswered message must be lost. (c) Every node processes a received message within a given, known time.

The real definition of “t-Connected Consistency” is too formal for me to explain here (see section 4.4). It (basically) guarantees (1) when there is no partition, the system is fully consistent; (2) if a partition happens, requests can see stale data; and (3) and after the partition is fixed, there’s a time limit on how long it takes for consistency to return.

Are the assumptions OK in practice? Every real computer can do timeouts, so (a) is no problem. You can always ignore any responses to messages after the time period, so (b) is OK. It’s not obvious that every system will obey (c), but some will.

I have two reservations. First, if the network is so big that it’s never entirely working at any one time, what would guarantee (3) mean? Second, in the algorithm in section 4.4, in the second step (“write at node A”), it retries as long as necessary to get a response. But that could exceed L, violating the availability guarantee.

So it’s not clear how attractive t-Connected Consistency really is. It can be hard it is to come up with formal proofs of more complicated, weakened consistency guarantees. Most working software engineers don’t think much about formal proofs, but don’t underrate them. Sometimes they can help you identifying bugs that would otherwise be hard to track down, before they happen.

Jeff Darcy wrote a blog posting about “eventual consistency” about a half year ago, which I recommend. And there are other kinds of weak consistency guarantees, such as the one provided by Amazon’s Dynamo key-store, which worth examining.

Reliable Networks

Can’t you just make the network reliable, so that messages are never lost? (“Never” meaning that the probability of losing a message is as low as other failure mode that you’re not protecting against.)

Lots and lots of experience has shown that in a network with lots of routers and such, no matter how much redundancy you add, you will experience lost messages, and you will see partitions that last for a significant amount of time. I don’t have a citation to prove this, but, ask around and that’s what experienced operators of distributed systems will always tell you.

How many routers is “lots”? How reliable is it if you have no routers (layer 3 switches), only hubs (layer 2 switches)? What if you don’t even have hubs? I don’t have answers to all this. But if you’re going to build a distributed system that depends on a reliable network, you had better ask experienced people about these questions. If it involves thousands of nodes and/or is geographically distributed, you can be sure that the network will have failures.

And again, as far as the proof of the CAP theorem is concerned, node failure is treated as a network failure. Having a perfect network does you no good if machines can crash, so you’d also need each node to be highly-available in and of itself. That would cost a lot more than using “commercial off-the-shelf” computers.

The Bottom Line

My conclusion is that the proof of the CAP theorem means, in practice: if you want to build a distributed system that is (1) large enough that nodes can fail and the network can’t be guaranteed to never lose messages, and (2) you want to get a useful response to every request within a specified maximum latency, then the best you can guarantee about the meaning of the response is that it is guaranteed to have some kind of “weak consistency”, which you had better carefully define in such a way that it’s useful.

P.S.

After writing this but just before posting it, Alex Feinberg added a comment to my previous blog post with a link to this excellent post by Henry Robinson, which discusses many of the same issues and links to even more posts. If you want to read more about all this, take a look.

VoltDB versus NoSQL

Sunday, July 11th, 2010

Mike Stonebraker is the co-founder and CTO of VoltDB, which makes a novel on-line transaction processing (OLTP) relational database management system (RDBMS). He recently gave a talk entitled “VoltDB Decapitates Six SQL Urban Myths”. You can read the slides here. Much of the talk is a reply to the claims of the community building data stores often referred to as NoSQL data stores.

Todd Hoff of HighScalability has written an excellent commentary on the talk. If you want to understand what’s going on with VoltDB, you can’t do better than to read this (including the commentary, with some replies from VoltDB). I have a bit to add.

Benchmarking

Dr. Stonebraker’s talk includes benchmark results, which VoltDB ran much faster than MySQL and , a well-known NoSQL data store.

Over many years, I have found that what nearly everybody wants is a predictive “single number” that says how much faster one DBMS is than another. But applications differ hugely in their workloads, and measured speed depends tremendously on using the DBMS in the best way, including layout, clustering, indexing, partitioning, and all kinds of options, such as whether transactions are immediately made durable or not. Saying that one DBMS is “N times faster” than another DBMS is very misleading. But everyone wants the magic number, and are too quick to assume that the result of one benchmark predicts speed in all situations.

One must take into account that the VoltDB engineers wrote these micro-benchmarks, and ran them on a very specific workload, knowing what they were trying to prove. I do appreciate that they made a good-faith attempt to be fair, based on John Hugg’s comments above. And I can vouch that John is a very smart guy, and I believe all that he says in his comments above. Nevertheless, they did not bring in experts in the other systems who would could tune them optimally. Different benchmarks might be less flattering.

The old argument about assembly language versus high-level languages would be analogous if RDBMS optimizers worked as well as C/Java/etc compilers. SQL is supposed to be declarative: you just ask for what you want, and the RDBMS figures out the best way to get it. But my experience, and what my friends tell me, is that the optimizers in some popular RDBMS’s (especially Oracle) frequently make bad choices, and picking the wrong query strategy can slow things down by huge factors. So the developers are forced to override the optimizer with “hints”. It’s been over 30 years, and still the optimizers fail. Maybe it’s time to declare the experiment a failure. (This may not be an issue for VoltDB, as the SQL might be always be very simple or something.)

Stored Procedures

He’s right that performance can be hurt by too many round trips to the DBMS. But Oracle users have know for a long time that you have to use stored procedures to get high performance; this is nothing new. When you do this with Oracle, you end up with lots of PL/SQL code. Most of your developers can’t understand it, and it’s a proprietary language so you’re “locked in” to Oracle (it’s very hard to switch to a different DBMS).

It’s one thing to provide stored procedures as a way to improve performance. But VoltDB requires you to use stored procedures, and each interaction with VoltDB is a transaction. Any application that mixes database access with other operations that must be done on the client side cannot use VoltDB. The application has to be written in the VoltDB manner, from the beginning. This is like “lock-in” in some ways.

More about the VoltDB presentation

Todd says: “In contrast, the VoltDB model seems from a different age: a small number of highly tended nodes in a centralized location.” I don’t think this is right. For disaster recovery (e.g. blackouts), you need a replica far away; this has always been an integral part of VoltDB’s justification for not logging to disk. And then you have to worry about network partitions over a WAN. WAN’s are not yet supported in VoltDB.

I find Todd’s point about Amazon’s Dynamo very compelling: why would Amazon do so much work if partitions are so rare? At Amazon scale, partitions must be frequent enough to justify all this work. Not all VoltDB customers will be operating at that scale, but John Hugg has said that it’s designed for “Internet scale”. Dr. Stonebraker is right that there’s no substitute for actual measurement of how likely partition is.

Putting the burden on application programmers

Serious production databases are usually manged by database experts/administrators, who decide where to replicate what, whether and how to partition tables (across nodes), and so on.

But with VoltDB, the application developers have to understand a lot about this. For example, they need to know whether a procedure is single-partitioned, so they can assert that in the code. So they have to know about sharding, where replicas are, and so on. It makes the application brittle insofar as changes by the database administrators could break those assertions.

For example, a VoltDB engineer explained to a customer: “The goal of VoltDB is to optimize single-partition transactions and part of the responsibility for that falls on the application developer. You must write the queries to operate properly within a single partition and then declare the procedure to be single-partitioned. [...] Today, VoltDB does not verify that the SQL queries within a single-partitioned procedure are actually single-partitioned.” Another VoltDB engineer said: “The vast majority (almost 100%) of your stored procedure invocations must be single-partition for VoltDB to be useful to you.”

Different “NoSQL” systems also put such burdens on application programmers to greater or lesser degrees, as well. RDBMS’s have traditionally boasted that they hide these issues from application programmers. VoltDB uses SQL, but what it provides is very different from the original concept of the relational model.

What is a “SQL” database system?

You can see more of this in their “VoltDB do’s and don’ts list” Perhaps the most important point is the first “Don’t”: “Don’t use ad hoc SQL queries as part of a production application.” Dr. Stonebraker’s talk is very much a defense of using SQL for OLTP, rather than the “NoSQL” models such as key-value stores. But what does the restriction against “ad hoc” queries mean?

The original fundamental claim of relational DBMS’s (as opposed to the previous generation, the CODASYL-type DBMS’s) is that you don’t have know the access pattern; you just say what you want in SQL, and the DBMS figures out how to do it. Applications keep working even if there are changes in the storage layout, indexing, and whatever else the DBMS uses. But, as a VoltDB engineer said, “Part of VoltDB’s underlying premise is that workloads are known in advance.”

Even though VoltDB uses SQL, maybe it isn’t as far from the “NoSQL” storage engines as one might think!

Come to New England Database Day!

Wednesday, January 28th, 2009

New England Database Day is a one-day mini-conference where participants from the research community in the New England area can come together to present ideas and discuss their research.  I highly recommend this event if you’re interested in cutting-edge database technology.  There will be eight talks plus poster sessions.

(I’ll say “database” to mean “database management system”, as is often done for brevity.)

Last year’s conference (the first one) was great. Here’s my very belated report.

David DeWitt’s paper on Clustera, which controls and runs large batch operations on a big cluster of machines.  There are three prominent classes of these, exemplified by (a) Condor, for running things like circuit simulations and weather models; (b) Gamma, for doing parallel database queries; and (c) Map/Reduce.  Clustera is designed to be able to do all three of these things reasonably well.  In fact, these three are just particular points in a whole space of possibilities, which Clustera can be used on.  Also, Clustera is simpler and smaller than other such systems, because it builds on a J2EE application server and a small relational database.  Prof. DeWitt has been in charge of the amazingly productive database system research at University of Wisconsin, Madison, although now he’s going to a new Microsoft research center to be created in Madison.

George Miklau explained about policy decisions regarding archival storage, such as privacy, accountability, retention policies, subpoenas, and redaction.  He talked about how technological decisions affect these things, too.

Stavros Harizopoulis of HP Labs described an experiment that demonstrates why main memory databases can be so fast, analyzing the costs of various modules that can be omitted such as logging (most kinds), locking, latching, buffer management, and other overhead.  No one of these takes the lion’s share of the time, it turns out.  You have to do all of them to get the best performance improvements.  A major point is that a database system designed to be column-oriented can be a lot faster than a general-purpose database acting as if it were column-oriented.

Ryan Johnson of CMU talked about many issues involved in executing queries in parallel on multi-core processors.  As you’d expect, this is a hot area, since the multi-core processors are becoming so widespread, and the number of cores is going up.  He examined work sharing, pipelining, working set size, and of course caching issues.  He presented experimental results as well.

Daniel Abadi of Yale (formerly of MIT) (not to be confused with Daniel Abadi of Microsoft Research) gave a talk called “How To Create a New Column-Store Database in a Week”.  The point was that you can do it, based on a regular row-store database, but he expains why this won’t work well.  A good column-store database must be built that way from the start.

Anastasia (Natasha) Ailamaki of Ecole Polytechnique Federale de Lausanne was honored by being the last speaker; she has won many awards and is a rising star in the database community.  Her talk was “Multi-Core: Friend or Foe?”  She explained a lot about how the memory/caching systems multi-core processors work.  She also explained some of the major design tradeoffs that the hardware designers can make: fewer, more complex cores, or the opposite, and whether hardware threads are used.  Then she talked about how all this particularly affects database systems.

The event will be in Cambridge, Mass. at MIT, in the Stata Center, room 32-123 (the big lecture hall on the first floor).  It’s be this Friday (January 30, 2009) from 9 am to 6 pm.  It’s free, but they’d like you to register so they’ll know how many people are coming.  I hope to see you there!

Perst, An Embedded Object-Oriented Database Management System

Sunday, June 8th, 2008

I just learned of Perst, which is described as an open-source embedded object-oriented Java database (meaning database management system, of course) which claims ease of working with Java objects, “exceptional transparent persistence”, and suitability for aspect-oriented programming with tools such as AspectJ and JAssist. It’s available under a dual license, with the free non-commercial version under the GPL. (There is a .NET version aimed at C# as well, but I’ll stick to what I know; presumably the important aspects are closely analogous.)

I have some familiarity with PSE Pro for Java, from what is now the ObjectStore division of Progress. Its official name is now Progress ObjectStore PSE Pro. I’ll refer to it as PP4J, for brevity. I was one of its original implementors, but it was substantially enhanced after I left. I also have a bit of familiarity with Berkeley Database Java Edition, a more recently developed embedded DBMS, which I’ll refer to as BDBJE.

PP4J provides excellent transparency, in the sense that you “just program in Java and everything works.” It does this by using a class-file postprocessor. However, Perst claims to provide the same benefit without such a preprocessor. It also claims to be “very fast, as much as four times faster than alternative commercial Java OODBMS.” Of course, “as much as” usually means that they found one such micro-benchmark; still, it would be uninteresting had they not even claimed good performance. And it has ACID transactions with “very fast” recovery.

Those are very impressive claims. In the rest of this post, I’ll examine them.

Who Created Perst?

I always like to glance at the background of the company. In particular, I like to know who the lead technical people are and where they worked before. Unfortunately, the company’s management page only lists the CEO, COO, and Director of Marketing, which is rather unusual. They’re in Issaquah, WA; could the technical people be ex-Microsoft? It’s important to note that McObject’s main product line, called eXtremeDB(tm), is technically unrelated to Perst.

But I found a clue. The Java package names start with org.garret. It’s usually hard to retroactively change Java package names, so they’re often leftovers from an earlier naming scheme. By doing some searches on “garret”, I found Konstantin Khizhnik, a 36-year old software engineer from Moscow with 16 or so years experience, who has written and is distributing an object-oriented database system called “GOODS” (the “G” stands for “Generic”). His most recent release was March 2, 2007. He has a table that compares the features of GOODS with those of other systems, including Perst. At the bottom it says: “I have three products GigaBASE, PERST and DyBASE build on the same core.” He also has an essay named Overview of Embedded Object Oriented Databases for Java and C# which includes an extensive section on the architecture of Perst. This page also has some micro-benchmark comparisons including Perst, PP4J, BDBJE, and db40, but not GOODS. Perst comes out looking very good.

He even has a table of different releases of several DBMS’s, including GOODS and Perst, saying what changes were made in each minor release! But at no point does he say that he was involved in creating the Perst technology.

He mentions the web site perst.org. There’s nothing relevant there now, but Brewster’s Wayback machine shows that there used to be, starting in October, 2004. It’s quite clearly the same Perst. And the “Back to my home page” link is to Knizhnik’s home page. Aha, the smoking gun! By December, 2005, the site now mentions the dual license, and directs you to McObject LLC for a non-GPL commercial license. In 2006, the site changes to the McObject web site. McObject has several other embedded database products and was founded in 2001. This strongly suggests that McObject bought Perst from Knizhnik in 2006.

I joined the Yahoo “oodbms” group, and there’s Knizhnik, who is apparently maintaining FastDB, GigaBASE, and GOODS. He also wrote Dybase, based on the same kernel as GigaBASE. He announced Perst Lite in October, 2006. Postings on this group are sparse, mainly consisting of announcements of new minor releases of those three DBMS’s

The Tutorial
The documentation starts with a tutorial. Here are the high points, with my own comments in square brackets. My comparisons are mainly with PP4J, which likewise provides transparent Java objects. BDBJE works at a lower level of abstraction. [Update: BDBJE now has a transparent Java object feature, called DPL.] I don’t know enough about the low-level organization of BDBJE or of the current PP4J to make well-informed qualitative comparisons.

Perst claims to efficiently manages much more data than can fit in main memory. It has slightly different libraries for Java 1.1, 1.4, and 1.5, and J2ME. There is a base class called Persistent that you have to use for all persistent classes. [This is a drawback, due to Java's lack of multiple inheritance of implementation. PP4J does not have this restriction.] They explain a workaround in which you can copy some of the code of their Persistent.java class. [That sounds somewhat questionable from a modularity point of view, and doesn't help you for third-party libraries unless you want to mess with their sources.]

Files that hold databases can be stored compressed, encrypted, or as several physical files, in no file at all for in-memory use, and there’s an interface allowing you to add your own low level representation. Each database has a root object in the usual way. They use persistence-by-reachability [like PP4J]. There is a garbage collector for persistent objects. However, there is also explicit deletion; they correctly point out that this can lead to dangling pointers. [The fact that they have it at all suggests that the garbage collector is not always good enough.]

There are six ways to model relationships between objects. To their credit, they have a “(!)” after the word “six”. You can use arrays, but they explain the drawbacks to this. The Relation class is like a persistent ArrayList. The Link class is like Relation but it’s embedded instead of being its own object [huh?]. The IPersistentList interface has implementing classes that store a collection as a B-tree, which is good for large collections but has high overhead for small ones. Similarly, there is IPersistentSet. And finally there is a collection that automatically mutates from Link to a B-tree as the size gets larger. [PP4J, I believe, offers equivalents of the array, the list, and the set, and the list and set do the automatic mutation.]

How can they do transparent loading of objects, i.e. following pointers? They give you two choices, which can be set as a mode for each object: load everything reachable from the object, or make the programmer explicitly call the “load” method. They claim that this is usually sufficient, since your Java data structures usually consist of clusters of Java objects that are reachable from one head object, with no references between such clusters.

They assume that you always want to read in the whole cluster when you touch the head object [often true, but not always]. Also, when you modify an object, you must explicitly call the “modify” method, unless this is one of Perst’s library classes, whose methods call “modify” on themselves when needed. They say “Failure to invoke the modifymethod can result in unpredictable application behavior.”

[This is not what I would call "transparent"! PP4J is truly transparent, in that there is neither a "load" nor a "modify". PP4J always does these automatically. The Perst tutorial does not say what happens if you forget to call "load" when you were supposed to. Not all Java data follows their cluster model. PP4J depends for its transparency on the class postprocessor. As I recall, the postprocessor runs quickly enough that it doesn't seriously impact the total compile time. The only problem I had with it, as a user, was that it doesn't fit with the model assumed by the interactive development environments such as IntelliJ, requiring some inelegance in setting up your project.]

Perst has collections with indexes implemented as B-trees and allowing exact, range, and prefix searches. The programmer must explicitly insert and delete objects from indexes; if you make a change to some object that might affect any indexes, you have to remove the object from the index and re-insert it. [So you need to know in advance which things might ever be indexed, or pessimistically assume that they all are, and so remove and re-insert whenever the object is changed in any way. [I am pretty sure that PP4J does this automatically.] You can index on fields directly, or you can create your own index values (since you’re inserting explicitly) that could be any function of the indexed object. [That's very useful, and I cannot remember whether PP4J provides this.] Keys can be compound (several fields). They provide R-tree indexes and KD-tree indexes, useful for 2-D operations such as finding a point within certain constraints. They also provide Patricia Tries, bit indexes, and more. [Wow, how do they fit all that into such a small footprint?]

Transaction recovery is done with a shadow-object mechanism and can only do one transaction at a time. (So ACID really means AD.) [Like PP4J, at least in its first version.] The interaction of transaction semantics with threads, always a sticky issue, can be done in several ways, too extensive to go into here. [This looks very good.] Locks are multiple-reader single-writer. Locking is not transparent in basic Perst [Bad!], but there’s a package called “Continuous” which does provide transparency, although it’s not described in the tutorial. [So beginning users also have to remember to do explicit locking?] Two processes can access a single database by locking the whole database at the file system level; it works to have many readers.

There is a class called “Database” that provides semantics more like a conventional DBMS. It maintaints extents of classes. [Note: that means instances of the class can never be deallocated.] It can created indexes automatically based on Java annotations, but you still must do manual updates when the indexed object changes. It uses table-level locking. It has a query lanauge called JSQL, but it’s unlike SQL in that it returns objects rather than tuples, and does not support joins, nested selects, grouping, nor aggregate functions. You can “prepare” (pre-parse) JSQL queries, to improve performance if you use them many times, just as with most relational DBMS’s. A JSQL query is like a SQL “where” clause, and it uses whatever existing indexes are appropriate.

Schema evolution is automatic, and done per-object as the object is modified. It can’t handle renaming classes and fields, moving fields to a descendant or ancestor class, changing the class hierarchy, nor changing types of fields that aren’t convertible in the Java language. There’s an export/import facility that you’d use for those changes. You can physically compact the database files. Backup and restore are just like files [you have to back up the whole thing, not just changes, which is probably true of PP4J as well.] You can export to XML.

Perst supports automatic database replication. There’s one master, where all writes are performed, and any number of slaves, where reads can be performed. This lets you load-balance reads. It’s done at page granularity. You can specify whether it’s synchronous or asynchronous. You can add new slaves dynamically. For high-availability, it’s your responsibility to detect node failure and choose a new master node. [PP4J did not have this, the last time I looked.]

Recent Press Releases from McObject

Version 3.0 has new features. There is a full-text search, much smaller in footprint than Lucene and geared specifically to persistent classes. The .NET version supports LINQ. They list CA’s Wily Technology as a successful reference customer, for a real-time Java application.

Perst is used in Frost, which is a client for Freenet. “Frost is a newsgroup reader-like client application used to share encrypted messages, files and other information over Freenet without fear of censorship.” They switched from “the previous used SQL database” [it turns out to be something called McKoi] because its recovery was unreliable (leaving corrupt databases), Perst’s schema evolution is easier to use, the databases are smaller (because Perst can store strings in UTF-8 encoding), and because they could get better performance from Perst as the database size grew.

Perst has been verified as compatible with Google’s Android. They provide a benchmark program comparing performance of Perst against Android’s bundled SQLList. It’s a simple program that makes objects with one int field and one String field, and an index on each field. It inserts, looks up, etc. [It would be easy to recode it for PP4J or for B2B4J.]

The Download

The basic jar file is 530KB. “continuous” (see above) is another 57KB.

There’s more documentation, which goes into great detail about the internal implementation of how objects are represented and how atomic commit works. [It's extremely similar to PP4J. (The original version, anyway; it might have changed since.)]

There are other features, for which I could not find documentation. For intsance, each persistent class can have a “custom” allocator that you supply. You could use this to represent very large objects (BLOB/CLOB) by putting them in a separate file. In the database, you’d store the name of this file, and perhaps an offset or whatever. Also, there is an implementation of the Resource Description Framework (RDF, used by the Semantic Web to represent metadata).

There are lots of parameters that you can set from environment variables. I was not able to find documentation for these. The one that interests me most lets you control the behavior of the object cache. The default is a weak hash table with LRU behavior. Other possibilities are a weak table without the LRU, a strong hash table (if you don’t want to limit the object cache size), and a SoftHashTable which uses a Java “soft” hash table.

The code is clearly written except that it’s extremely short on comments.

Overall Evaluation

Perst is a lot like PP4J. To my mind, the most important difference is the degree of transparency. I greatly prefer PP4J’s approach of providing complete transparency, i.e. not requiring the use of methods such as load and modify. This has two advantages. First, your code is clearer and simpler if isn’t interrupted by all those calls to load and modify. Second, without transparency, it’s far too easy to forget to call load or modify, which would cause a bug, in some cases a bug that’s hard to find. Another problem is that the reference documentation is clearly incomplete and needs work. The tutorial, though, is quite clear and professionally-written, and very honest about the tradeoffs, pros, and cons of the product design. Personally, if you want to my respect, that’s how to do it!

However, it has a bunch of features and package that PP4J doesn’t (as far as I know).

I don’t know anything about the pricing of either product.

On the whole, for what it’s aiming for, Perst appears to be a very good, and a real competitor in this space.