Archive for the ‘Storage’ Category

NoSQL Storage Systems Never Violate ACID. Never? Well, Hardly Ever!

Tuesday, September 7th, 2010
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

Everybody agrees that the new “NoSQL” storage systems “aren’t ACID”, or “don’t have transactions”.  This is true <i>in a sense</i>, but without knowing the sense, it doesn’t tell you much.

In one sense, they <i>do</i> have transactions that are limited to having one operation per transaction.  One operation could mean reading, writing, incrementing, or doubling the value associated with a particular key.  For example, look at an “insert” operation in a key/value store.  An operations acts on only one data object.  Are these single-operation transactions ACID?  Let’s check each criterion:

A means “atomic”: either all the operations happen, or none of them happens.  Well, there’s only one operation.  The key-value store <i>does</i> guarantee that either the insert happens, or it doesn’t.  So the transaction atomic.

C means “consistent”.  In relational database systems, people use this to mean that various interesting consistency guarantees are maintained.  But here, we don’t have to worry about such things as referential integrity, since there are no references to have integrity; that is, there are no foreign keys.  So it’s consistent.

I means “isolated”: concurrency is never seen by the application.  The system behaves as if each operation happened at a particular, distinct moment in time.  The key-value stores all make this guarantee.

D means “durable”: before the application is told that the transaction has been completed successfully (i.e. committed), any side-effects it does are in stable storage so that if a node stops (such as a crash of a process or a whole node) won’t lose the results of the side-effects.  Here, a transaction is only one operation, but that doesn’t change anything: the system does provide “durability”.  (Some systems might cheat by not actually forcing data to stable storage, but we’re not talking about those.)

So it appears to be ACID!  OK, something has <i>got</i> to be wrong here, right?

Right.  Where I tried to pull the wool over your eyes is the definition of “C”.  “C” doesn’t just mean conforming to the databases integrity constraints.  It means that the system returns the correct answer! That is, response to any operation is consistent with some state that the database could be in.  There’s more than one such state when there are concurrent operations going on, which might be ordered in more than one way, depending on how the concurrency system works.  So it’s clearer to think of “C” as meaning “correct”.  (In the famous Gilbert and Lynch paper that “proves the CAP theorem”, that’s what they mean by “C”.)

The “NoSQL” storage systems are guaranteed return the correct answer <i>only</i>if there are no partitions in the network.  But if there are (or were, e.g. at write time) partitions, they can return things like “two replicas say the value is X, but another replica says that the answer is Y”, and the application has to try to make sense of and cope with that.  That is <i>not</i> “C”.  This is usually called “eventually consistency”: if the partitions were to eventually heal and the system deferred accepting new operations until all the in-progress operations finished, and something went over the whole database to fix up any inconsistencies that happened during writes, then the system would become fully consistent, and would be behave correctly until the next partition.

that there are at least two nodes that cannot send messages between each other.  It’s important to know that if a node in your your system is down, that’s considered a partition: it’s as if this node were disconnected from the network.

The “NoSQL” systems are ACID, as long as you accept that a transaction can only perform one operation, in the sense that the only thing that gets in the way of being ACID is when there are network partitions and the system is called upon to perform operations while the partition is still there.

“Partition” is a somewhat slippery concept that I will examine in an upcoming separate essay.  But the basic ides is that a it means that there are at least two nodes that cannot send messages between them.  It’s important to know that if a node in your your system is down, that’s considered a partition: it’s as if this node were disconnected from the network.

This also shows that the name “NoSQL” doesn’t explain everything that’s important about these systems.  But you can’t pack a whole lot into a short, punchy name, so I’m not really complaining.  ( do the same thing with the names of my blog essays; <i>mea culpa<i>.  You just have to keep in mind that the lack of SQL is not the only important thing.

VoltDB versus NoSQL

Sunday, July 11th, 2010
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

Mike Stonebraker is the co-founder and CTO of VoltDB, which makes a novel on-line transaction processing (OLTP) relational database management system (RDBMS). He recently gave a talk entitled “VoltDB Decapitates Six SQL Urban Myths”. You can read the slides here. Much of the talk is a reply to the claims of the community building data stores often referred to as NoSQL data stores.

Todd Hoff of HighScalability has written an excellent commentary on the talk. If you want to understand what’s going on with VoltDB, you can’t do better than to read this (including the commentary, with some replies from VoltDB). I have a bit to add.

Benchmarking

Dr. Stonebraker’s talk includes benchmark results, which VoltDB ran much faster than MySQL and , a well-known NoSQL data store.

Over many years, I have found that what nearly everybody wants is a predictive “single number” that says how much faster one DBMS is than another. But applications differ hugely in their workloads, and measured speed depends tremendously on using the DBMS in the best way, including layout, clustering, indexing, partitioning, and all kinds of options, such as whether transactions are immediately made durable or not. Saying that one DBMS is “N times faster” than another DBMS is very misleading. But everyone wants the magic number, and are too quick to assume that the result of one benchmark predicts speed in all situations.

One must take into account that the VoltDB engineers wrote these micro-benchmarks, and ran them on a very specific workload, knowing what they were trying to prove. I do appreciate that they made a good-faith attempt to be fair, based on John Hugg’s comments above. And I can vouch that John is a very smart guy, and I believe all that he says in his comments above. Nevertheless, they did not bring in experts in the other systems who would could tune them optimally. Different benchmarks might be less flattering.

The old argument about assembly language versus high-level languages would be analogous if RDBMS optimizers worked as well as C/Java/etc compilers. SQL is supposed to be declarative: you just ask for what you want, and the RDBMS figures out the best way to get it. But my experience, and what my friends tell me, is that the optimizers in some popular RDBMS’s (especially Oracle) frequently make bad choices, and picking the wrong query strategy can slow things down by huge factors. So the developers are forced to override the optimizer with “hints”. It’s been over 30 years, and still the optimizers fail. Maybe it’s time to declare the experiment a failure. (This may not be an issue for VoltDB, as the SQL might be always be very simple or something.)

Stored Procedures

He’s right that performance can be hurt by too many round trips to the DBMS. But Oracle users have know for a long time that you have to use stored procedures to get high performance; this is nothing new. When you do this with Oracle, you end up with lots of PL/SQL code. Most of your developers can’t understand it, and it’s a proprietary language so you’re “locked in” to Oracle (it’s very hard to switch to a different DBMS).

It’s one thing to provide stored procedures as a way to improve performance. But VoltDB requires you to use stored procedures, and each interaction with VoltDB is a transaction. Any application that mixes database access with other operations that must be done on the client side cannot use VoltDB. The application has to be written in the VoltDB manner, from the beginning. This is like “lock-in” in some ways.

More about the VoltDB presentation

Todd says: “In contrast, the VoltDB model seems from a different age: a small number of highly tended nodes in a centralized location.” I don’t think this is right. For disaster recovery (e.g. blackouts), you need a replica far away; this has always been an integral part of VoltDB’s justification for not logging to disk. And then you have to worry about network partitions over a WAN. WAN’s are not yet supported in VoltDB.

I find Todd’s point about Amazon’s Dynamo very compelling: why would Amazon do so much work if partitions are so rare? At Amazon scale, partitions must be frequent enough to justify all this work. Not all VoltDB customers will be operating at that scale, but John Hugg has said that it’s designed for “Internet scale”. Dr. Stonebraker is right that there’s no substitute for actual measurement of how likely partition is.

Putting the burden on application programmers

Serious production databases are usually manged by database experts/administrators, who decide where to replicate what, whether and how to partition tables (across nodes), and so on.

But with VoltDB, the application developers have to understand a lot about this. For example, they need to know whether a procedure is single-partitioned, so they can assert that in the code. So they have to know about sharding, where replicas are, and so on. It makes the application brittle insofar as changes by the database administrators could break those assertions.

For example, a VoltDB engineer explained to a customer: “The goal of VoltDB is to optimize single-partition transactions and part of the responsibility for that falls on the application developer. You must write the queries to operate properly within a single partition and then declare the procedure to be single-partitioned. [...] Today, VoltDB does not verify that the SQL queries within a single-partitioned procedure are actually single-partitioned.” Another VoltDB engineer said: “The vast majority (almost 100%) of your stored procedure invocations must be single-partition for VoltDB to be useful to you.”

Different “NoSQL” systems also put such burdens on application programmers to greater or lesser degrees, as well. RDBMS’s have traditionally boasted that they hide these issues from application programmers. VoltDB uses SQL, but what it provides is very different from the original concept of the relational model.

What is a “SQL” database system?

You can see more of this in their “VoltDB do’s and don’ts list” Perhaps the most important point is the first “Don’t”: “Don’t use ad hoc SQL queries as part of a production application.” Dr. Stonebraker’s talk is very much a defense of using SQL for OLTP, rather than the “NoSQL” models such as key-value stores. But what does the restriction against “ad hoc” queries mean?

The original fundamental claim of relational DBMS’s (as opposed to the previous generation, the CODASYL-type DBMS’s) is that you don’t have know the access pattern; you just say what you want in SQL, and the DBMS figures out how to do it. Applications keep working even if there are changes in the storage layout, indexing, and whatever else the DBMS uses. But, as a VoltDB engineer said, “Part of VoltDB’s underlying premise is that workloads are known in advance.”

Even though VoltDB uses SQL, maybe it isn’t as far from the “NoSQL” storage engines as one might think!

Using Solid State Disks on Linux

Saturday, February 21st, 2009
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

Solid-state disks (SSDs) are getting less expensive, faster, and larger.  I just bought a lightweight laptop with 128GB of SSD instead of a disk. Just to see what I’d find out, I poked around on the web looking for information how to use SSD’s under Linux.  Keep in mind that I am not an expert on SSD’s, nor on Linux!  Bearing that in mind, here’s what I found:

Tuning Linux for SSDs

Here a quick summary of Tom Bryer’s “Four Tweaks for Using Linux with Solid State Drives” (Sept 2008)

If you’re using Linux with SSD’s, it is recommended to use the noatime option to turn off writing the “last accessed time” attribute to files.  This avoids writes, increasing the lifetime of the SSD.  (As root, edit /etc/fstab and change “relatime” to “noatime” on SSD partitions.  This might only apply to ext3.)

You can create a tmpfs partition (in RAM) and make Firefox use it for its cache, to reduce disk writes.  Edit the file /etc/fstab and add:

tmpfs /tmp tmpfs defaults,noatime,mode=1777 0 0

Then, in Firefox, open about:config, right click in an open area, create a new string value called:

browser.cache.disk.parent_directory

and set it to /tmp.

If you write a large file to the disk, Linux will stop any other application’s attempts to write, potentially for a long time.  To greatly reduce the pause, change the I/O scheduler for SSD’s. Do:

cat /sys/block/sda/queue/scheduler

to get the current scheduler for a disk (sda, in this case) and to see the alternative options. You’ll probably have four options, the one in brackets is currently being used by the disk specified in the previous command:

noop anticipatory deadline [cfq]

Now do (as root):

echo deadline > /sys/block/sda/queue/scheduler

File Systems for SSDs

What’s a good Linux file system to use for SSD’s?  A lot of people have asked this on the web and gotten very few straight answers.  There is jffs2 but everybody seems to think it’s lousy.  Some people think that ext2 is considered better than ext3, which is a journaling file system that does more writes.  However, journaling keeps the file systems’ metadata consistent after a crash, so it’s quite valuable. Surely there’s a lot more to say that this, but I wasn’t able to find it.

SanDisk has announced ExtremeFFS.  It looks like this is not a Linux file system, but rather the hardware acts like a disk.  If so, one could take advantage of this technology from non-Linux machines.

Samsung says that ExtremeFFS  uses a non-blocking architecture in which all of the NAND channels of the SSD can behave independently.  It can read and write at the same time.  They also claim that it can speed up random writes by 100x!   How they do it  is explained in this article by Chris Mellor.  They avoid the need for erases in a lot of cases.  Also there is a garbage collector!

I found a comment saying that “this sounds like what Fusion-io is doing on the ioDrive.”  Fusion-io makes very high-speed SSD’s.

Also

Although one person points out that your SSD may outlive your laptop, or you can replace it with a larger, cheaper, faster one that will be around at that time (assuming that you can get your data off before it’s too late).  But avoiding journaling is also good for speed, not just longevity.

It’s good to align your file system on an erase-block boundary.  Especially if you’re using RAID, so that a whole stripe can be copied efficiently.  You want your partition aligned on a 128K boundary.  Theodore Tso’s blog item provides vast technical detail.

Please share what you’ve found out about these things.  Thanks!

Reblog this post [with Zemanta]