Archive for the ‘Uncategorized’ Category

Adventures trying to use open-source libraries

Sunday, April 5th, 2009

It seems that whenever I want to use an open source library, I run into problems because of various kinds of dependencies. I’ve run into this with Java and C++ libraries. Most recently, I had one of these adventures with a Common Lisp library.

Babel is an excellent open-source Common Lisp library for converting between string representations, such as the different encodings of UNICODE, as well as EBCDIC characters and so on. It’s portable and efficient. We use it to decode UTF-8 into full 32-bit UNICODE.

Recently we suspected that it might be running more slowly than we’d like, and that we might be able to get a measurable speedup by optimizing it. So I thought I’d write a simple benchmark and try some changes that might speed it up, such as adding fixnum declarations.

Babel includes a regression test. Obviously, I needed to make sure any speedups that I put in would not break Babel, so running the regression test would be important. This is where the fun began.

Babel’s regression test depends on a Common Lisp unit test framework called stefil (which I’d never heard of). I found stefil on the web, but there was no source distribution. The only way to get it was to use darcs.

The machine on my desktop uses an old version of Linux. (The reasons are too boring to go into here, and it’ll be upgraded soon.) It does not have darcs already installed on it. No problem, I said to myself, and proceeded to obtain darcs. It turns out that darcs comes in source form, so you have to compile it.

Darcs is written in Haskell, and my Linux machine does not already have the Haskell compiler. So I downloaded the compiler (GHC), and tried to compile it. But I got weird error messages about missing C header files. I could not figure this out, because the build mechanism for GHC is rather complicated, using tools that I would have had to figure out, etc.

Finally I gave up, and found someone with a more modern version of Linux that already had darcs. He got stefil for me.

Next, I found that stefil depends on several other Common Lisp libraries: Swank, alexandria, iterate, and metabang-bind. We already had Swank (which is part of Slime), and alexandria, so I found and downloaded iterate and metabang-bind.

I got error messages trying to compile stefil. It eventually turned out that stefil depends on a non-standard version of Swank, and will not compile with any other version. Since I did not need the feature that integrates stefil with Slime/Swank, I had to comment out the dependency on stefil in its asdf file (which is like a makefile).

Compiling stefil still failed, because it uses the iterate library, and iterate includes a Common Lisp code walker, and in the version of Clozure Common Lisp that we use at ITA, assert macroexpands to a non-portable form that the code walker does not understand. This feature in assert was added for us in order make the code coverage tool know that it’s OK that we do not cover assert forms, but, of course, iterate’s code walker didn’t know about it. (A code walker must know about every Lisp “special form”.) I fixed this by learning how the code walker is organized, and extending it to know about assert as a primitive special form.

Finally, the babel regression tests turned out to have bugs. They depend on char-code always returning a fixnum, which is a violation of the Common Lisp standard. I had to fix various things and comment out other things in order to make the unit test work properly with Clozure Common Lisp (which was not at fault).

After all this, I was able to run the regression test, and so I could proceed to make changes to Babel with some assurance that I didn’t introduce bugs. But it all took so much time that I fell behind in my work schedule, which was, to say the least, annoying.

The problem partly lies with my using such an old version of Linux, but this kind of problem seems to be common with open source libraries in all languages and domains. If they’re not used very widely, and not maintained, they often don’t work well together.

Perst, An Embedded Object-Oriented Database Management System

Sunday, June 8th, 2008

I just learned of Perst, which is described as an open-source embedded object-oriented Java database (meaning database management system, of course) which claims ease of working with Java objects, “exceptional transparent persistence”, and suitability for aspect-oriented programming with tools such as AspectJ and JAssist. It’s available under a dual license, with the free non-commercial version under the GPL. (There is a .NET version aimed at C# as well, but I’ll stick to what I know; presumably the important aspects are closely analogous.)

I have some familiarity with PSE Pro for Java, from what is now the ObjectStore division of Progress. Its official name is now Progress ObjectStore PSE Pro. I’ll refer to it as PP4J, for brevity. I was one of its original implementors, but it was substantially enhanced after I left. I also have a bit of familiarity with Berkeley Database Java Edition, a more recently developed embedded DBMS, which I’ll refer to as BDBJE.

PP4J provides excellent transparency, in the sense that you “just program in Java and everything works.” It does this by using a class-file postprocessor. However, Perst claims to provide the same benefit without such a preprocessor. It also claims to be “very fast, as much as four times faster than alternative commercial Java OODBMS.” Of course, “as much as” usually means that they found one such micro-benchmark; still, it would be uninteresting had they not even claimed good performance. And it has ACID transactions with “very fast” recovery.

Those are very impressive claims. In the rest of this post, I’ll examine them.

Who Created Perst?

I always like to glance at the background of the company. In particular, I like to know who the lead technical people are and where they worked before. Unfortunately, the company’s management page only lists the CEO, COO, and Director of Marketing, which is rather unusual. They’re in Issaquah, WA; could the technical people be ex-Microsoft? It’s important to note that McObject’s main product line, called eXtremeDB(tm), is technically unrelated to Perst.

But I found a clue. The Java package names start with org.garret. It’s usually hard to retroactively change Java package names, so they’re often leftovers from an earlier naming scheme. By doing some searches on “garret”, I found Konstantin Khizhnik, a 36-year old software engineer from Moscow with 16 or so years experience, who has written and is distributing an object-oriented database system called “GOODS” (the “G” stands for “Generic”). His most recent release was March 2, 2007. He has a table that compares the features of GOODS with those of other systems, including Perst. At the bottom it says: “I have three products GigaBASE, PERST and DyBASE build on the same core.” He also has an essay named Overview of Embedded Object Oriented Databases for Java and C# which includes an extensive section on the architecture of Perst. This page also has some micro-benchmark comparisons including Perst, PP4J, BDBJE, and db40, but not GOODS. Perst comes out looking very good.

He even has a table of different releases of several DBMS’s, including GOODS and Perst, saying what changes were made in each minor release! But at no point does he say that he was involved in creating the Perst technology.

He mentions the web site perst.org. There’s nothing relevant there now, but Brewster’s Wayback machine shows that there used to be, starting in October, 2004. It’s quite clearly the same Perst. And the “Back to my home page” link is to Knizhnik’s home page. Aha, the smoking gun! By December, 2005, the site now mentions the dual license, and directs you to McObject LLC for a non-GPL commercial license. In 2006, the site changes to the McObject web site. McObject has several other embedded database products and was founded in 2001. This strongly suggests that McObject bought Perst from Knizhnik in 2006.

I joined the Yahoo “oodbms” group, and there’s Knizhnik, who is apparently maintaining FastDB, GigaBASE, and GOODS. He also wrote Dybase, based on the same kernel as GigaBASE. He announced Perst Lite in October, 2006. Postings on this group are sparse, mainly consisting of announcements of new minor releases of those three DBMS’s

The Tutorial
The documentation starts with a tutorial. Here are the high points, with my own comments in square brackets. My comparisons are mainly with PP4J, which likewise provides transparent Java objects. BDBJE works at a lower level of abstraction. [Update: BDBJE now has a transparent Java object feature, called DPL.] I don’t know enough about the low-level organization of BDBJE or of the current PP4J to make well-informed qualitative comparisons.

Perst claims to efficiently manages much more data than can fit in main memory. It has slightly different libraries for Java 1.1, 1.4, and 1.5, and J2ME. There is a base class called Persistent that you have to use for all persistent classes. [This is a drawback, due to Java's lack of multiple inheritance of implementation. PP4J does not have this restriction.] They explain a workaround in which you can copy some of the code of their Persistent.java class. [That sounds somewhat questionable from a modularity point of view, and doesn't help you for third-party libraries unless you want to mess with their sources.]

Files that hold databases can be stored compressed, encrypted, or as several physical files, in no file at all for in-memory use, and there’s an interface allowing you to add your own low level representation. Each database has a root object in the usual way. They use persistence-by-reachability [like PP4J]. There is a garbage collector for persistent objects. However, there is also explicit deletion; they correctly point out that this can lead to dangling pointers. [The fact that they have it at all suggests that the garbage collector is not always good enough.]

There are six ways to model relationships between objects. To their credit, they have a “(!)” after the word “six”. You can use arrays, but they explain the drawbacks to this. The Relation class is like a persistent ArrayList. The Link class is like Relation but it’s embedded instead of being its own object [huh?]. The IPersistentList interface has implementing classes that store a collection as a B-tree, which is good for large collections but has high overhead for small ones. Similarly, there is IPersistentSet. And finally there is a collection that automatically mutates from Link to a B-tree as the size gets larger. [PP4J, I believe, offers equivalents of the array, the list, and the set, and the list and set do the automatic mutation.]

How can they do transparent loading of objects, i.e. following pointers? They give you two choices, which can be set as a mode for each object: load everything reachable from the object, or make the programmer explicitly call the “load” method. They claim that this is usually sufficient, since your Java data structures usually consist of clusters of Java objects that are reachable from one head object, with no references between such clusters.

They assume that you always want to read in the whole cluster when you touch the head object [often true, but not always]. Also, when you modify an object, you must explicitly call the “modify” method, unless this is one of Perst’s library classes, whose methods call “modify” on themselves when needed. They say “Failure to invoke the modifymethod can result in unpredictable application behavior.”

[This is not what I would call "transparent"! PP4J is truly transparent, in that there is neither a "load" nor a "modify". PP4J always does these automatically. The Perst tutorial does not say what happens if you forget to call "load" when you were supposed to. Not all Java data follows their cluster model. PP4J depends for its transparency on the class postprocessor. As I recall, the postprocessor runs quickly enough that it doesn't seriously impact the total compile time. The only problem I had with it, as a user, was that it doesn't fit with the model assumed by the interactive development environments such as IntelliJ, requiring some inelegance in setting up your project.]

Perst has collections with indexes implemented as B-trees and allowing exact, range, and prefix searches. The programmer must explicitly insert and delete objects from indexes; if you make a change to some object that might affect any indexes, you have to remove the object from the index and re-insert it. [So you need to know in advance which things might ever be indexed, or pessimistically assume that they all are, and so remove and re-insert whenever the object is changed in any way. [I am pretty sure that PP4J does this automatically.] You can index on fields directly, or you can create your own index values (since you’re inserting explicitly) that could be any function of the indexed object. [That's very useful, and I cannot remember whether PP4J provides this.] Keys can be compound (several fields). They provide R-tree indexes and KD-tree indexes, useful for 2-D operations such as finding a point within certain constraints. They also provide Patricia Tries, bit indexes, and more. [Wow, how do they fit all that into such a small footprint?]

Transaction recovery is done with a shadow-object mechanism and can only do one transaction at a time. (So ACID really means AD.) [Like PP4J, at least in its first version.] The interaction of transaction semantics with threads, always a sticky issue, can be done in several ways, too extensive to go into here. [This looks very good.] Locks are multiple-reader single-writer. Locking is not transparent in basic Perst [Bad!], but there’s a package called “Continuous” which does provide transparency, although it’s not described in the tutorial. [So beginning users also have to remember to do explicit locking?] Two processes can access a single database by locking the whole database at the file system level; it works to have many readers.

There is a class called “Database” that provides semantics more like a conventional DBMS. It maintaints extents of classes. [Note: that means instances of the class can never be deallocated.] It can created indexes automatically based on Java annotations, but you still must do manual updates when the indexed object changes. It uses table-level locking. It has a query lanauge called JSQL, but it’s unlike SQL in that it returns objects rather than tuples, and does not support joins, nested selects, grouping, nor aggregate functions. You can “prepare” (pre-parse) JSQL queries, to improve performance if you use them many times, just as with most relational DBMS’s. A JSQL query is like a SQL “where” clause, and it uses whatever existing indexes are appropriate.

Schema evolution is automatic, and done per-object as the object is modified. It can’t handle renaming classes and fields, moving fields to a descendant or ancestor class, changing the class hierarchy, nor changing types of fields that aren’t convertible in the Java language. There’s an export/import facility that you’d use for those changes. You can physically compact the database files. Backup and restore are just like files [you have to back up the whole thing, not just changes, which is probably true of PP4J as well.] You can export to XML.

Perst supports automatic database replication. There’s one master, where all writes are performed, and any number of slaves, where reads can be performed. This lets you load-balance reads. It’s done at page granularity. You can specify whether it’s synchronous or asynchronous. You can add new slaves dynamically. For high-availability, it’s your responsibility to detect node failure and choose a new master node. [PP4J did not have this, the last time I looked.]

Recent Press Releases from McObject

Version 3.0 has new features. There is a full-text search, much smaller in footprint than Lucene and geared specifically to persistent classes. The .NET version supports LINQ. They list CA’s Wily Technology as a successful reference customer, for a real-time Java application.

Perst is used in Frost, which is a client for Freenet. “Frost is a newsgroup reader-like client application used to share encrypted messages, files and other information over Freenet without fear of censorship.” They switched from “the previous used SQL database” [it turns out to be something called McKoi] because its recovery was unreliable (leaving corrupt databases), Perst’s schema evolution is easier to use, the databases are smaller (because Perst can store strings in UTF-8 encoding), and because they could get better performance from Perst as the database size grew.

Perst has been verified as compatible with Google’s Android. They provide a benchmark program comparing performance of Perst against Android’s bundled SQLList. It’s a simple program that makes objects with one int field and one String field, and an index on each field. It inserts, looks up, etc. [It would be easy to recode it for PP4J or for B2B4J.]

The Download

The basic jar file is 530KB. “continuous” (see above) is another 57KB.

There’s more documentation, which goes into great detail about the internal implementation of how objects are represented and how atomic commit works. [It's extremely similar to PP4J. (The original version, anyway; it might have changed since.)]

There are other features, for which I could not find documentation. For intsance, each persistent class can have a “custom” allocator that you supply. You could use this to represent very large objects (BLOB/CLOB) by putting them in a separate file. In the database, you’d store the name of this file, and perhaps an offset or whatever. Also, there is an implementation of the Resource Description Framework (RDF, used by the Semantic Web to represent metadata).

There are lots of parameters that you can set from environment variables. I was not able to find documentation for these. The one that interests me most lets you control the behavior of the object cache. The default is a weak hash table with LRU behavior. Other possibilities are a weak table without the LRU, a strong hash table (if you don’t want to limit the object cache size), and a SoftHashTable which uses a Java “soft” hash table.

The code is clearly written except that it’s extremely short on comments.

Overall Evaluation

Perst is a lot like PP4J. To my mind, the most important difference is the degree of transparency. I greatly prefer PP4J’s approach of providing complete transparency, i.e. not requiring the use of methods such as load and modify. This has two advantages. First, your code is clearer and simpler if isn’t interrupted by all those calls to load and modify. Second, without transparency, it’s far too easy to forget to call load or modify, which would cause a bug, in some cases a bug that’s hard to find. Another problem is that the reference documentation is clearly incomplete and needs work. The tutorial, though, is quite clear and professionally-written, and very honest about the tradeoffs, pros, and cons of the product design. Personally, if you want to my respect, that’s how to do it!

However, it has a bunch of features and package that PP4J doesn’t (as far as I know).

I don’t know anything about the pricing of either product.

On the whole, for what it’s aiming for, Perst appears to be a very good, and a real competitor in this space.

This Is Your Brain On Music

Wednesday, April 30th, 2008

This Is Your Brain On Music, by Daniel J. Levitin, is the most exciting science book that I’ve read in a long time. It’s all about music: what is music, how do we perceive music, why do we care about music, and, primarily, what do we know about how the mind and brain react, process, and create music.

Some facts that I learned:

If I put electrodes in your visual cortext, and then I showed you a red tomato, there is no group of neurons and will cause my electrodes to turn red. But if I put electrodes in your auditory cortext and play a pure tone in your ears at 440 Hz, there are neurons in your auditory cortex that will fire at exactly that frequency, causing the electrode to emit electrical activity at 440 Hz — for pitch, what goes into the ear comes out of the brain! I find this amazing.

If you’re familiar with the phenomenon of restoration of the missing fundamental, in which you perceive the fundamental pitch if you are played only overtones of the pitch: it turns out that you can put in an electrode, play music with the fundamentals missing, and the electrode actually shows energy at the fundamental frequency! The very fact that we can know things like this is exciting.

Ordinary people, when asked to sing a song (of which there is one well-known canonical recording, such as most pop songs), will sing back the song at almost the exactly correct tempo! (They are accurate within 4%, which is as good as most people can perceive anyway.) They also often get the key right, even though few people have “perfect pitch” per se. I would never have guessed this.

The brain stem and the dorsal cochlear nucleus — structures so primitive that all vertebrates have them — can distinguish between [musical] consonance and dissonance; this distinction happens before the higher level, human brain region — the cortex — gets involved.

The book is extremely readable and fun. It teaches you all the music theory you need to know. In fact, his basic music theory section is the best quick introduction to music theory I’ve ever read. The author has been a professional producer, so he knows a lot about how modern music recordings are made. He currently runs the Laboratory for Musical Perception, Cognition, and Expertise at McGill Unversity, and has published a lot in serious scientific journals. That’s a combination of expertise that may be unique. He knows several well-known musicians and quotes from them; what Stevie Wonder and Joni Mitchell have to say is quite interesting. The book is available in trade paperback for only $15 US.

Nerdcore Rising – A Documentary about MC Frontalot

Monday, April 28th, 2008

Last night, I saw a screening of a new independent film called Nerdcore Rising.  The filmmakers follow a country-wide tour of a band called MC Frontalot, who are leaders in the genre of nerdcore hip hop, also known as geeksta rap.  (That’s the name of the rapper himself, as well as the band, I think.) As I’m sure you can guess, it’s a cross between actual musical talent and self-deprecating ironic humor. Afterwards, we went to a party at Mantra Restaurant in Boston, which was turned into a nightclub for the occasion.  ITA Software, my employer, sponsored the event.  (Of course, for us it was a recruiting event, at least putatively.  Come work here!)

The director, Negin Farsad, and many members of her team, were there and talked about the film.  At the end, the members of the band walked out and took questions.  Apparently there are quite a few nerdcore hip hop bands; estimates range from 50 to hundreds.

I’m not a rap fan at all, and I admit that I had trouble following the lyrics, which went by very quickly.  And at the nightclub, everything was much too loud, and the vocals were mixed much too low (as always seems to be the case in these venues).  Despite all that, I had a very good time.

Despite Farsad’s protestations that she only just learned how to make movies, it’s quite professionally done.  The editing is great, and keeps the movie moving right along.  It’s exciting and funny.  If you get a chance, check it out.  There’s plenty of music (with lyrics, yay!)  at MC Frontalot’s web site.