What Is a “Better” Database System?

November 7th, 2010
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

What makes one database management system better than another?

Let’s compare two database system, A and B, that use the same data model (such as “relational”) and the same transaction types (such as “ACID”, or ACID with some reduced isolation level. How do we decide whether A is better than B?

These days, the answer is, which one has better latency or throughput for a given scenario. A scenario is defined by the contents of the database and the particular queries (any kind of request) it is given. If you read the marketing literature for any commercial database system, that’s what it talks about. The scenarios could include new datatypes, streaming, and so on. But the metric that measures “better” is still speed.

But is this right for all of today’s needs?

An exciting paper by Daniela Florescu and Donald Kossmann, entitled Rethinking Cost and Performance in Database Systems, suggests a different way of looking at things: given a set of performance requirements, and consistency needs, what is the least expensive database system we can build that meets those parameters. (You can also find the paper here.)

For example, sometimes latency (response time) only has to be fast enough that a human being won’t be bothered by having to wait that much longer. There’s no point in trying to get the latency “as low as possible” when 100 milliseconds is just fine.

The paper was published in SIGMOD Record, a journal sent to ACM members who are in SIGMOD (the special interest group for Management of Data). These are typically researchers, or people who follow the research closely. The paper suggests to the entire database research community that they should consider changing their whole orientation in this way. If you are working on building a database management system, or interested in looking at some of the new databases and data stores, this new viewpoint is illuminating. I recommend the paper highly.

ACID in Theory and Practice

November 5th, 2010
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

The new so-called NoSQL data stores have been criticized, often by the traditional database community, because they sacrifice “ACID transactions”. Is this fair? How much does it matter? I’ll briefly go over what ACID transactions are and what they’re for, and then look at how they’re used, or not.

ACID

A “transaction” works this like: a thread (locus of control) does the following steps:

  • A “begin transaction” operation
  • Arbitrary computation, which can include:
    • examining data
    • modifying data
  • Either a “commit transaction” operation, or an “abort transaction” operation.

If the “commit transaction” operation completes (i.e. returns to its caller in the thread), the transaction is said to have committed. If the thread does an “abort transaction”, or if the thread halts (the thread gets an unhanded exception, the thread is killed, the process is killed, the hardware crashes), the transaction is said to have “aborted”.

(In some systems, the “begin transaction” is implicit when the previous transaction completes; it doesn’t matter.)

Ideally, a transaction has four properties, usually described with the helpful mnemonic “ACID”:

Atomic: If a transaction modifies the data and the transaction commits, all of the changes are performed; if the transaction aborts, none of them happens.

Consistent: There is some predicate on the data, called “consistency”. If the data is in a consistent state when the transaction has first started (before it performs any side-effects), then it is consistent after the transaction finishes. (This is trivially true if the transaction aborts.)

Isolated: Although many threads of control might be examining or modifying the data concurrently (interleaved in time), everything behaves as if they were sequential, i.e one at a time in some order.

Durable: If the transaction commits, any modifications it has made are “durable”, which means that they take effect even if there is a halt.

“Consistency” is hard to define. What is really means it that the data in the database is an accurate representation of the real world (for example, account X has $A and account B has $B), and that the transactions that moved the database state from a before-state to an after-state are consistent with real world operations (e.g. money has been withdrawn from account X and deposited in account Y).

Unfortunately, there isn’t any real way to check and enforce this. So what happens depends a lot on the application. Often what people mean by “consistency” is that certain invariants are met. Some database systems provide support for adding checks for these invariants, called “integrity constraint”. Meeting these constraints is necessary but not, in general, sufficient for consistency, but that’s often what people mean by “consistency”. Mostly people don’t pay much attention to “C”, anyway.

If a data storage system is both Atomic and Durable together, then modifications made by a committed transaction are all performed on the database, even in the face of a halting failure. This plus Isolation presents the application with an abstraction that’s very clean and easy to deal with.

Most important, ACID is entirely independent of the application. The concerns of the application are entirely separated from the concerns of failure and interleaving. This separation of concerns makes things much simpler, and reducing complexity is of great value.

Isolation in Theory and Practice

Writing an application is easy with isolation, because the programmer can ignore concurrency. But do people really use database systems this way? When we look around, we find the concept of an “isolation level”, in which an application can decide how much isolation it wants. Don’t they all want total isolation? Yes, but there’s a big problem: total isolation hurts performance severely in so many cases that it’s rarely used! If you don’t believe me, consider the following.

Thomas Kyte has written widely about Oracle DB, especially about how its transactions work. His book, “Expert Oracle Database Administration”, was recommended to me by a skilled Oracle database administrator; Kyle is highly respected. Although Oracle DB can do ACID transactions, the book strongly recommends against using them. Oracle DB has more than one “isolation level”. The strongest, READ REPEATABLE, provides ACID transactions. (Almost. If you care about the “phantom read” issue, you don’t need me to tell you about this stuff.) Instead, he recommends that you use the READ COMMITTED isolation level. He says that it is “the most commonly used isolation level” and that “it is rare to see a different isolation level used.”

When using READ COMMITTED, you are not guaranteed to get “repeatable reads”. That is, during the course of a transaction, you might read a value, and later read the same value and get a different result back, because of writes by concurrent transactions! Remember that a read is often not a direct request to read just one column of one row; it’s often part of a more general SQL query. You might not even know that there is some data that two SQL queries both read. This is not what I’d call the “I” in “ACID”. The concurrency, rather than being cleanly separated from the application, is now exposed to the application. The application writer has to know that reads are not repeatable and take that into account, which makes his or her life harder.

This isn’t specific to Oracle. In fact, it’s so pervasive in relational databases that it’s even part of the SQL standard. Isolation levels are so important that they aren’t just an implementation-specific hack. The official SQL standard defines several reduced levels of isolation.

Here’s another story about not using ACID. I and the rest of the ObjectStore team at Object Design once had the great opportunity to talk with some of the most renowned database experts in the world, at IBM’s Alamaden Research Center. These are the people who designed one of the earliest relational database systems (System/R), and continued to do groundbreaking work, which can you can read in many excellent papers they have published, many of which I had read. The group included Don Haderly, C. Mohan, Bruce Lindsay, and others, If these people don’t know about transactions, nobody does.

When they heard us say that ObjectStore provided real ACID transactions, they were surprised, and explained to us that nobody really uses those. They said you mustn’t do that, or your database system will be too slow.

They said, what our relational database applications use is “cursor-stability isolation”. Here’s how that one works. In a relational database, you typically perform a query, and get back a sequence of rows (a.k.a. tuples). The application iterates over the tuples, with a “cursor” to keep track of where in the sequence it’s up to. With “cursor-stable” isolation, when the cursor moves to a row, that row is locked. When it moves the cursor to the next row, the old row is unlocked and the next row is locked. At the end, the last row is unlocked and the transaction ends.

I was very surprised. While the application is working with one row, all the other rows could change out from under it. If you were trying to sum up some column (attribute) of each row, you might not get a consistent snapshot of the database.

For example, suppose each row represented a bank account, and an application A wants to transfer $100 from account A to account B. Concurrently, application B wants to sum up the total amount in all accounts. B should get the same answer no matter how it is interleaved with A. B should not see an inconsistent state where the debit has been done and the credit has not. I asked how an application e deal with such confusing behavior. This is a lot like the Oracle situation: reads are not repeatable.

I was very surprised: how can the application writers be expected to deal with this lack of isolation? The answer went something like this:

Summing up a column was really done in one SQL transaction using a SUM aggregate, and in that case the problem does not arise, because within a single SQL query, you do get isolated behavior. (This is true in Oracle as well.) Many common simple cases can be handled using SQL aggregation operators.

Yes, it’s true that if you have more than one query in your transaction, the application programmer does have to be aware of possible effects of interleaving. However, in real life (they said), most transactions are simple enough that it’s not so hard to reason about the effects of reduced isolation, and sometimes you can just ignore them.

To me, this was not a very satisfying answer. It’s like saying, well, it works in simple cases and when you’re lucky.

In ObjectStore, there were data structures much more complicated than tables. Indeed, ObjectStore could store anything that you can express in your programming languages (C++ or Java). We didn’t see any way to something analogous. We got away with using ACID because the sweet spot for ObjectStore wasn’t applications doing fine-grained interleaving.

Who Casts the First Stone?

The ACID transaction abstraction provides an excellent separation of concerns. It’s true that the NoSQL stores, with their “eventual consistency” properties, or their “return many possibly-different values” API’s, force the application to live with weaker guarantees than ACID. But so do the real relational database systems. Academic papers or commercial white papers that criticize the NoSQL data stores for not providing ACID should be fair: in the real world, nobody who cares about fine-grained concurrency is providing ACID guarantees.

Addendum of Nov 8, 2010

One of ITA’s very knowlegable Oracle experts pointed out some things that some issues that I should have discussed.

I should have mentioned that using Oracle’s “read committed” isolation, you do get repeatable reads within a single SQL query. When writing software that uses relational databases, it’s good to do as much as you can within a single query, rather than doing many queries as part of an imperative flow of control. All other things being equal, declarative code is better than imperative code. It is much easier for a person to reason about, which makes code clearer and easier to understand. Also, it makes code easier for a computer to understand. Writing an optimizer for imperative code is harder than writing one for declarative code.

Our expert tells me that sometimes programmers, who are generally trained in, and experienced with, imperative coding, will sometimes write programs that do one query after another, when it could have been done in a single query. To be sure, to do it in a single query can require you to learn morea bout SQL. But if you’re using a relational database that uses SQL, you really ought to learn that stuff. If you are using a tool, you should learn to use it.

Of course, not all situations allow you to take a transaction and make it only need one SQL statement. But if you can do that, you get transaction guarantees that are much closer to ACID. (It’s still not precisely ACID due to the so-called “phantom” scenario, but I will cut Oracle slack for that since it’s hard to solve in their architecture.)

However, I’ll add that one of the criticisms of the “NoSQL” data stores that the relational experts make is that they can only do one operation in a query. While that is true, and it is a disadvantage, it’s also true that if you use Oracle, your transaction has better properties if it only performs one operation (query) per transaction. That’s not an apples-to-apples comparison (traditional RDBMS’s are capable of doing multi-query transactions, but it’s something to think about.

OpenSQL Boston 2010 Takes Place This Weekend

October 13th, 2010
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

OpenSQL Camp Boston happens this weekend. It’s an unConference, which means anybody can give a talk and anybody can listen. There arew usually several parallel tracks. This is an unConference about open source databases, both relational and non-relational databases, database alternatives like “NoSQL stores”, and so on. There will be people from PostgreSQL, MySQL, MariaDB, VoltDB, Rackspace, InfoBright, BerkeleyDB, MIT, and others.

The events are:

  • Friday Oct 15, at 6pm: social event at WorkBar Boston, 711 Atlantic Ave, Boston, MA
  • Saturday, Oct 16: unConference at the Stata Center
  • Saturday, Oct 17: more unConference at the Stata Center, ending 6:00 p.m.

Click here for the full info.

More About Data Models

October 8th, 2010
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

Reading the comments on my earlier posts, as well as other posts in other blogs, shows me that there is still some confusion about the relational model. I’d like to clear some of this up.

I’m going to stop talking about the “network” or “CODASYL” model. I don’t know its details and history. I’ll just compare the relational model with the hierarchical model, as seen in, e.g. IBM’s IMS.

Object-Oriented Database Systems

It’s ironic that I’m being looked at as some kind of “relational fanboy”. The reason that I first learned about database systems was to build an object-oriented database system (for and in Common Lisp,) at Symbolics. It was called Statice and the first version was completed in 1988.

After that, Symbolics refused to let us port Statice to other hardware. Meanwhile, the object-oriented database system idea was picking up a lot of “traction”. A group of seven of us formed Object Design, where we built the best and most successful (if I may say so) object-oriented database system (this time in C++, later in Java).

I am not a relational model fanboy in the least! The point of my recent blog postings is to explain why Object Design kept getting pelted with rotten tomatoes by real the relational database fanboys. I call them “fanboys” because their devotion to the relational model was often blind. Either they did not understand what problem the relational model was intended to solve, or they didn’t understand why that wasn’t the problem we were addressing.

Despite the fact that I’m an object-oriented database guy through and through, I can still understand the purpose and benefits of the relational model Because these issues are arising in the “NoSQL” world, I want to try to lay out an explanation, especially to show what a data model is, and how it’s different from transaction and scalability issues.

What a Data Model Is

The a data model is a way to model some facts about the real world that you care about. The best book I know of about modelling data is “Data and Reality”, by William Kent, then of Hewlett-Packard. It was written in 1978 but it’s so fundamental that nothing has changed. Quite a lot of real-world facts can be modeled with normalized relational models. People can debate over whether such a representation is natural, unnatural, easy or hard to think about. If you want to make a whole new data model, read Kent’s book before you get too far.

The model you choose to use says absolutely nothing per se about indexes, how much time certain operations take, transactions, eventual consistency, or cloud computing. Implementing a database system based on a certain data model is another thing entirely, which I’ll be discussing in later posts.

Schema Evolution

Does the relational model, in and of itself, say anything about taking existing data, and changing the relational schema so that the facts in the original database are faithfully represented in the new database? I’m not 100% sure, but I don’t think so. I think the relational model does not talk about schema evolution, just about how to represent things as of now.

(Believe me, I know about schema evolution in relational database systems. I know a lot more than I want to know. My co-workers have been through a lot of pain. As my son says, “Sad face!” To be fair, it wasn’t easy in ObjectStore either: we did it but it was hard.)

Whether a schema is “sparse” can mean one of two things. First, it can mean that the model is relational but, as an implementation convenience, we don’t store certain things that we don’t need in certain places, because the columns and rows can be defaulted, or compressed out. In that sense, it’s just an implementation technique that has nothing to do with the model.

Second, they can mean that there is a flexible schema. For example, an individual document in MongoDB can have a new attribute name that was never used anywhere else. That is not relational.

C. J. Date’s Book Title

In the previous blog posts, I did not mean to imply that C. J. Date thinks that all database systems use the relational model. All I was trying to do was point out, as a sort of footnote, that his book does not have the word “Relational” in its title. He just takes it for granted. In my earlier post, Why Relational Databases, Anyway?, my point was that many people take for granted that the relational model is always right for every single data storage need, and anything that is not relational (such as object-oriented database systems) are inherently misguided. That is not the case.

Hierarchical versus Relational: Anticipating New Queries

Consider the standard example of Company-Division-Employee, where each company has many divisions and each division has many employees. In a normalized relational representation, there is a table called “Company” with one row per for each company being modeled. There is a “Division” table with one row per division being modeled, and a column that is a foreign key to the row of Company that represents this division’s company. The next level, Employee, is analogous.

The data model is not hierarchical. You could say that the particular schema, when looked at in one particular way, acts hierarchically. But the schema I’m describing, in and of itself, viewed purely from the relational model (ignoring implementation details) can model things that don’t look hierarchical whatsoever.

In the old-style hierarchical model, in order to say “add up the salaries of every employee”, your query has to start by saying “for all the companies…”. In the relational model, you don’t do that. It’s much simpler. You just go to the Employee table, and add up the values of the Salary column. You can ignore that hierarchy entirely! That’s the key benefit of the relational model.

Data Independence, Again

This is what I was saying in the previous blog post: The Problem That Relational Databases Solve. In this example, if we suddenly find that we need to get the sum of all the employee salaries, the data model makes that easy. The relational model, with normalization, is doing its best to anticipate queries that might come with a new application, and make them easy and natural. (Remember, the mind set is that applications come and go, but data stays for a long time.) That is what data independence is for.

If you or I had to make a data model with these properties, and we had never seen the relational model before, would we have come up with the model, just as E. F. Codd did? I don’t know about you, but I doubt I would have. I think it’s a novel and brilliant idea, to solve the particular problem that it was designed to solve.

Does NoSQL Mean We Don’t Like the Relational Model?

The fact that the “NoSQL” database systems are flourishing does not, in and of itself, mean that people are intentionally turning away from the relational model. After all, if someone primarily wanted a different model, they could just build a conventional centralized DBMS that that model. This would have nothing to do with scaling and transactions. ObjectStore and many other object-oriented databases did that, and those are not the only examples.

It’s one thing to say that giving up the relational model is a cost that is more than balanced by the benefits of a new type data store. It’s quite another to say that the relational model itself is undesirable in all cases. One could feel either way, but I think it’s helpful to distinguish these two issues.

Coming Soon

What about using the actual DBMS products that you can buy? What are the consequences of denormalizing, of using BLOBs/CLOBs, and so on?

The Problem That Relational Databases Solve

October 4th, 2010
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

As I said last time, “data independence” is a clean separation between applications and data. What problem does it solve, and how does it solve it?

In my previous post, I talked about the people who take the relational model for granted. Where did it first come from, and why?

(Most of this essay is taken or paraphrased from perhaps the best expositor of the relational concept, C. J. Date. I am grateful to Prof. Michael Stonebraker for his comments on an earlier draft. As always, any errors are mine alone.)

Data Independence

For a large enterprise, there is a very large body of crucial information. These are the “crown jewels” of the information technology part of the company. This information lasts for the whole lifetime of the enterprise. But applications come and go, like migrating birds. The next application to come along might want access data in a different way, for important reasons. The structure of the database structure must adapt well to these new and changing demands.

With the older styles of data organization (called “network” or “CODASYL”, roughly speaking), sometimes the new application could not be done efficiently. Many times, for all practical purposes, it was impossible to write the application with acceptable performance. You can find the details of this in many books, but to give just one analogy: suppose you have a program with nested loops. In many cases (not 2D arrays), it’s pretty obvious which loop ought to be on the outside. Well, imagine if you forced to do it the other way, even if it made the program very much slower. And that’s just one example.

To solve this, we want data organization that can do two things. First, give every application a view of the database that doesn’t change over time, so that the application keeps working. Second, have a way to change the physical organization of the data without changing any of the software that uses the database system, which may be needed to make the new applications faster without hurting the old ones, or not hurting enough that it matters much. This is called “data independence.”

The Relational Model

A novel and effective solution to data independence, the “relational”, was created by E. F. Codd, in 1970. By representing data in relations, in normalized form, you can solve both of the above problems. I won’t go over all that here; I recommend “An Introduction to Database Systems” by C. J. Date.

(By the way, notice that the name of the book isn’t “… to Relational Database Systems,” even though that’s what the book is. Why bother with a superlative adjective, when “everybody knows” that all database systems, other than ancient ones, are relational?)

The relational model, as an abstract concept, is an excellent and brilliant solution to the data independence problem. Later we’ll see that that is not the only problem for which people want to store data. But in the next post, I’ll look into how well actual relational database systems implement the concept.

Postscript: I am only talking here about the way data is modeled. I’ll talk about transaction issues later.)