<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Dan Weinreb's blog &#187; Database</title>
	<atom:link href="http://danweinreb.org/blog/category/database/feed" rel="self" type="application/rss+xml" />
	<link>http://danweinreb.org/blog</link>
	<description>Software and Innovation</description>
	<lastBuildDate>Mon, 05 Sep 2011 20:12:36 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Improving the PACELC Taxonomy</title>
		<link>http://danweinreb.org/blog/improving-the-pacelc-taxonomy</link>
		<comments>http://danweinreb.org/blog/improving-the-pacelc-taxonomy#comments</comments>
		<pubDate>Wed, 12 Jan 2011 20:51:47 +0000</pubDate>
		<dc:creator>Dan Weinreb</dc:creator>
				<category><![CDATA[Concurrency]]></category>
		<category><![CDATA[Database]]></category>
		<category><![CDATA[High Availabilty]]></category>
		<category><![CDATA[NoSQL]]></category>

		<guid isPermaLink="false">http://danweinreb.org/blog/?p=607</guid>
		<description><![CDATA[news and informationbusiness,health,entertainment,technology&#160;automotive,business,crime,health,life,politics,science,technology,travelDaniel Abadi of Yale published a blog article last April criticizing the characterization of distributed storage systems using &#8220;you only get two of C, A, and P&#8221;. He proposed a new taxonomy with the acronym PACELC, to be read &#8220;If P, then trade off A for C; if E, trade off L for [...]]]></description>
			<content:encoded><![CDATA[<div style="height:33px; padding-top:2px; padding-bottom:2px; clear:both;" class="vas_pro_2"><div style="float:left; width:100px; " class="vas_pro_2_facebook_like"> 
				<iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fdanweinreb.org%2Fblog%2Fimproving-the-pacelc-taxonomy&amp;layout=button_count&amp;show_faces=false&amp;width=100&amp;action=like&amp;colorscheme=light&amp;height=27" 
					scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:100px; height:27px;" allowTransparency="true"></iframe>
			</div><div style="float:left; width:90px; padding-left:10px;" class="vas_pro_2_google1"> 
				<g:plusone size="medium" href="http://danweinreb.org/blog/improving-the-pacelc-taxonomy" ></g:plusone>
			</div></div>
		<div style="display:none;"><a href="http://www.24wn.com">news and information</a><a href="http://www.forum1000.com">business,health,entertainment,technology</a>&nbsp;<a href="http://news365online.com">automotive,business,crime,health,life,politics,science,technology,travel</a></div><div style="clear:both;"></div><p>Daniel Abadi of Yale published a <a href="http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html">blog article last April</a> criticizing the characterization of distributed storage systems using &#8220;you only get two of C, A, and P&#8221;.  He proposed a new taxonomy with the acronym PACELC, to be read &#8220;If P, then trade off A for C; if E, trade off L for C&#8221;, with this meaning:</p>
<ul>
<li>When there is a partition, how does the system trade off between:</li>
<ul>
<li>Availability and</li>
<li>Consistency</li>
</ul>
</ul>
<ul>
<li>Else, when there are no partitions, how does the system trade off between:</li>
<ul>
<li>Latency and</li>
<li>Consistency</li>
</ul>
</ul>
<p>For example, a system characterized as PA/EL means that in the face of partitions, it favors availability over consistency, and if everything&#8217;s working, it favors low latency over consistency.</p>
<p>I think this is moving very much in the right direction, and I hope I can contribute and help develop these ideas a bit.</p>
<p><b>Problems with the &#8220;Proof of the CAP theorem&#8221;</b></p>
<p>The &#8220;CAP&#8221; characterization has a lot of problems.  It is especially poorly applied, if not actually misused, when someone trots out the &#8220;proof of the CAP theorem&#8221; to show how they were forced into a tradeoff.  While the proof is correct, what is proves is too crude to model what we really care about.</p>
<p>I discussed the proof in an <a href="http://danweinreb.org/blog/what-does-the-proof-of-the-cap-theorem-mean">earlier post.</a>  In the proof, each attribute is problematic:</p>
<ol>
<li>&#8220;Consistency&#8221; means the behavior that would have happened on a single server that never crashed and did each operation in serial, which is fine, but lack of consistency means that the system makes <i>no</i> guarantees or representations about the result of an operation <i>whatsoever</i>.</li>
<li>&#8220;Available&#8221; means that you get an answer &#8220;eventually&#8221;, but since eventually can mean any amount of time (a trillion years), there&#8217;s no practical difference between A and not A.</li>
<li>&#8220;Partition-tolerance&#8221; is never actually defined in the paper.</li>
</ol>
<p><b>&#8220;E&#8221; implies &#8220;A&#8221;</b></p>
<p>The reason the acronym doesn&#8217;t need to be &#8220;PACELCA&#8221; is that if there are no partitions, then the system must be available.  Adding an &#8220;A&#8221; to the second part is redundant.  But for me (maybe not for you), putting in the redundant &#8220;A&#8221; in the &#8220;E&#8221; case helps me.  A PA/EL system is always &#8220;available&#8221;, and calling it PA/ELA makes it easier for me to see that availability is always there.</p>
<p><b>How do Availability and Latency relate?</b></p>
<p>Consider what &#8220;highly available&#8221; and &#8220;low latency&#8221; mean.  They are not entirely distinct and orthogonal.  The only useful meaning of &#8220;A&#8221; is that the system replies within a maximum latency.  It could be something like &#8220;response within 10ms at least 90% of the time and within 100ms in any case&#8221; rather than a simple deadline.  We can call this &#8220;fast enough&#8221; to meet the system requirements.  So availability is about latency.</p>
<p>There is, however, an important practical difference.  &#8220;Available&#8221; refers to a system&#8217;s latency related to the amount of time it takes to repair a partition.</p>
<p>To see this, consider two web sites (with human users) that are based on a system that can have partitions:</p>
<ul>
<li>The operators of the system move so quickly that they always fix partitions within 10ms.  The system is &#8220;available&#8221; even in the face of any single partition, without any special mechanism to be &#8220;partition tolerant&#8221;.</li>
<li>The operators of the system move so slowly that it takes them five minutes to fix a partition.  If the system has no way to be &#8220;partition tolerant&#8221;, it&#8217;s not available.</li>
</ul>
<p>Latency (the &#8220;L&#8221; in PACELC) has nothing to do with repair time, since it only applies when there are no partitions.  A web site is far better with a (maximum/average/whatever) latency of 10ms than with 1000ms.</p>
<p>So &#8220;A&#8221; and &#8220;L&#8221; are different.  But, that said, even if a system meets its &#8220;A&#8221; (fast enough) requirement, it can be valuable to lower the latency below that requirement.  The &#8220;PAC&#8221; characterization does not take this into account.</p>
<p><b>PC/EL is confusing</b></p>
<p>If a system is consistent when there are partitions, then surely it&#8217;s also consistent when there aren&#8217;t any partitions.  If the components work better, the service should not be worse.</p>
<p>At first glance, this seems to mean that &#8220;if PC, then EC&#8221;.  That would mean that PC/EL can&#8217;t describe any realistic system, but Prof. Abadi characterizes PNUTS/Sherpa (as originally presented).  I&#8217;m sure that there isn&#8217;t really a paradoxical situation with any real system, but rather that there is a way to misinterpret the PACELC notation.  What do PC and EL really mean?</p>
<p>PC means that if a client sends a request when there are partitions that prevent the system from answering promptly and correctly, then the system does not answer, rather than providing an answer that might be incorrect.  Indeed, it might not be able to reply at all, since a total failure is a kind of partition, and there just isn&#8217;t anybody to send back a reply.</p>
<p>EL means that if a client sends a request, and the system can choose between waiting a longer time to send a consistent answer, versus waiting a shorter time to send an inconsistent answer, it chooses (or tilts toward) the latter.</p>
<p><b>Loose ends</b></p>
<p><ui></p>
<li>What does &#8220;C&#8221; really mean?  Can&#8217;t we say something better than &#8220;we don&#8217;t guarantee consistency&#8221;?  Dynamo can give you answers that are not definitive but are very useful, with semantics that the application can understand.  What about &#8220;eventual consistency&#8221;?</li>
<li>What about durability?  There&#8217;s a big difference between some data being temporarily offline versus data being lost forever.  Some systems use &#8220;commits over a WAN&#8221; to replace the use of disks, and then the tradeoff of latency versus correctness, from synchronous to asynchronous commits, is important.</li>
<li>Should be distinguish between &#8220;available for read&#8221; vs. &#8220;available for write&#8221;?  This can come up in, e.g., a master-slaves configuration.</li>
<p></ui></p>
<p>Stay tuned.</p>
]]></content:encoded>
			<wfw:commentRss>http://danweinreb.org/blog/improving-the-pacelc-taxonomy/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Errors in Database Systems Still Must Consider Network Partitions</title>
		<link>http://danweinreb.org/blog/errors-in-database-systems-still-must-consider-network-partitions</link>
		<comments>http://danweinreb.org/blog/errors-in-database-systems-still-must-consider-network-partitions#comments</comments>
		<pubDate>Tue, 14 Dec 2010 14:11:46 +0000</pubDate>
		<dc:creator>Dan Weinreb</dc:creator>
				<category><![CDATA[Database]]></category>

		<guid isPermaLink="false">http://danweinreb.org/blog/?p=597</guid>
		<description><![CDATA[news and informationbusiness,health,entertainment,technology&#160;automotive,business,crime,health,life,politics,science,technology,travelProf. Michael Stonebraker wrote a paper, published in the April 2010 issue of CACM, entitled &#8220;Errors in Database Systems: Eventual Consistency and the CAP Theorem&#8221;. As I see it, the overall point of the paper is that the kinds of failures that cause partition-tolerance problems are rare, and not too significant compared to [...]]]></description>
			<content:encoded><![CDATA[<div style="height:33px; padding-top:2px; padding-bottom:2px; clear:both;" class="vas_pro_2"><div style="float:left; width:100px; " class="vas_pro_2_facebook_like"> 
				<iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fdanweinreb.org%2Fblog%2Ferrors-in-database-systems-still-must-consider-network-partitions&amp;layout=button_count&amp;show_faces=false&amp;width=100&amp;action=like&amp;colorscheme=light&amp;height=27" 
					scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:100px; height:27px;" allowTransparency="true"></iframe>
			</div><div style="float:left; width:90px; padding-left:10px;" class="vas_pro_2_google1"> 
				<g:plusone size="medium" href="http://danweinreb.org/blog/errors-in-database-systems-still-must-consider-network-partitions" ></g:plusone>
			</div></div>
		<div style="display:none;"><a href="http://www.24wn.com">news and information</a><a href="http://www.forum1000.com">business,health,entertainment,technology</a>&nbsp;<a href="http://news365online.com">automotive,business,crime,health,life,politics,science,technology,travel</a></div><div style="clear:both;"></div><p>Prof. Michael Stonebraker wrote a paper, published in the April 2010 issue of CACM, entitled <a href="http://cacm.acm.org/blogs/blog-cacm/83396-errors-in-database-systems-eventual-consistency-and-the-cap-theorem/fulltext">&#8220;Errors in Database Systems: Eventual Consistency and the CAP Theorem&#8221;.</a> As I see it, the overall point of the paper is that the kinds of failures that cause partition-tolerance problems are rare, and not too significant compared to the other ways that a DBMS can fail.  Therefore, people are worrying too much about <a href="http://danweinreb.org/blog/what-does-the-proof-of-the-cap-theorem-mean">the CAP issue</a>, specifically network partitions.</p>
<p>The paper enumerates eight causes of DBMS failure as seen by an application, such as software errors in the DBMS itself, operating system errors, and so on.  Number six in his list is &#8220;a network partition in a local cluster&#8221;, and number eight is &#8220;a network failure in the WAN connecting clusters together; the WAN failed and clusters can no longer all communicate with each other&#8221;.  (The usual reason one would have multiple clusters connected by a WAN is for &#8220;disaster recovery&#8221;, i.e. dealing with a problem that causes an entire cluster to fail, such as power loss over all hardware on which the cluster depends.)</p>
<p>Number six is the crucial issue in the paper, as far as the CAP issue goes.  He has two answers:</p>
<ol>
<li>In my experience, this is exceedingly rare, especially if one replicates the LAN (as Tandem did).</li>
<li>The overwhelming majority [of local failures] cause a single node to fail, which is a degenerate case of a network partition that is easily survived by lots of algorithms.</li>
</ol>
<p>About #1, I have spent some time talking to the operations architects at ITA Software, which has run high-availability servers for many years now, and heard about their experience.  It depends on what you mean by a &#8220;LAN&#8221;.  If you mean a few computers connected together by an Ethernet, with redundant hardware all around, then the chance of a failure of the network itself is relatively low.  However, real-world data centers with a relatively large number of servers rarely work this way.  The problem is that a real network is very complicated.  It depends on switches at both level 3 (routers) and level 2 (hubs).  Situations can arise in which pieces of the network are mis-configured by accident; these can be hard to find due, ironically,  the very redundancy that was added to avoid failures.  In particular, there is no way to make any kind of guarantee about the latency within the network, nor the likelihood that a packet will make it from its source to its destination.</p>
<p>About #2, it would have been helpful if there were citations in the paper.  It is hard to reply to such a claim without specifics.  One of the techniques to deal with one server being down involved &#8220;quorums&#8221;, but they can introduce problems with high-availability.</p>
<p>But, more importantly, consider some of the failure modes that <a href="http://www.google.com/search?hl=&amp;q=amazon+dynamo+paper&amp;sourceid=navclient-ff&amp;rlz=1B3GGLL_en___US410&amp;ie=UTF-8&amp;aq=1&amp;oq=amazon+dynamo">Amazon&#8217;s &#8220;Dynamo&#8221; highly-available key-value store</a> is built to deal with.  Suppose we have a key-value pair that resides on two servers, for high availability.  Call them A1 and A2.  An application changes the value of the pair, but does so at a time with A1 is down or unreachable.  So the update is made to replica A2.  Later, a second application reads the value associated with the key, but this time server A1 is down or unreachable, and server A2 is available.  The second application might not see the new value written by the first application.  I don&#8217;t know any of &#8220;lots of algorithms&#8221; that deals with this sort of scenario while providing complete consistency/correctness.</p>
<p>The conclusion of the paper is:  &#8220;In summary, one should not throw out the C so quickly, since there are real error scenarios where CAP does not apply and it seems like a bad tradeoff in many of the other situations.&#8221;  But since network-partition failures really can happen, it&#8217;s not clear that one can simply <em>decide</em> not to throw out the consistency/correctness criterion.</p>
]]></content:encoded>
			<wfw:commentRss>http://danweinreb.org/blog/errors-in-database-systems-still-must-consider-network-partitions/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>What Is a &#8220;Better&#8221; Database System?</title>
		<link>http://danweinreb.org/blog/what-is-a-better-database-system</link>
		<comments>http://danweinreb.org/blog/what-is-a-better-database-system#comments</comments>
		<pubDate>Sun, 07 Nov 2010 22:49:10 +0000</pubDate>
		<dc:creator>Dan Weinreb</dc:creator>
				<category><![CDATA[Database]]></category>

		<guid isPermaLink="false">http://danweinreb.org/blog/?p=567</guid>
		<description><![CDATA[news and informationbusiness,health,entertainment,technology&#160;automotive,business,crime,health,life,politics,science,technology,travelWhat makes one database management system better than another? Let&#8217;s compare two database system, A and B, that use the same data model (such as &#8220;relational&#8221;) and the same transaction types (such as &#8220;ACID&#8221;, or ACID with some reduced isolation level. How do we decide whether A is better than B? These days, [...]]]></description>
			<content:encoded><![CDATA[<div style="height:33px; padding-top:2px; padding-bottom:2px; clear:both;" class="vas_pro_2"><div style="float:left; width:100px; " class="vas_pro_2_facebook_like"> 
				<iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fdanweinreb.org%2Fblog%2Fwhat-is-a-better-database-system&amp;layout=button_count&amp;show_faces=false&amp;width=100&amp;action=like&amp;colorscheme=light&amp;height=27" 
					scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:100px; height:27px;" allowTransparency="true"></iframe>
			</div><div style="float:left; width:90px; padding-left:10px;" class="vas_pro_2_google1"> 
				<g:plusone size="medium" href="http://danweinreb.org/blog/what-is-a-better-database-system" ></g:plusone>
			</div></div>
		<div style="display:none;"><a href="http://www.24wn.com">news and information</a><a href="http://www.forum1000.com">business,health,entertainment,technology</a>&nbsp;<a href="http://news365online.com">automotive,business,crime,health,life,politics,science,technology,travel</a></div><div style="clear:both;"></div><p>What makes one database management system better than another?</p>
<p>Let&#8217;s compare two database system, A and B, that use the same data model <a href="http://danweinreb.org/blog/why-relational-databases-anyway">(such as &#8220;relational&#8221;)</a> and the same transaction types (such as &#8220;ACID&#8221;, or <a href="http://danweinreb.org/blog/acid-in-theory-and-practice">ACID with some reduced isolation level</a>.  How do we decide whether A is better than B?</p>
<p>These days, the answer is, which one has better latency or throughput for a given scenario.  A scenario is defined by the contents of the database and the particular queries (any kind of request) it is given.  If you read the marketing literature for any commercial database system, that&#8217;s what it talks about.  The scenarios could include new datatypes, streaming, and so on.  But the metric that measures &#8220;better&#8221; is still speed.</p>
<p>But is this right for all of today&#8217;s needs?</p>
<p>An exciting paper by Daniela Florescu and Donald Kossmann, entitled <a href="http://www.dbis.ethz.ch/research/publications/sigrec08.pdf">Rethinking Cost and Performance in Database Systems</a>, suggests a different way of looking at things: given a set of performance requirements, and consistency needs, what is the least expensive database system we can build that meets those parameters.  (<a href="http://www.sigmod.org/publications/sigmod-record/0903/p43.articles.florescu.pdf">You can also find the paper here.</a>)</p>
<p>For example, sometimes latency (response time) only has to be fast enough that a human being won&#8217;t be bothered by having to wait that much longer.  There&#8217;s no point in trying to get the latency &#8220;as low as possible&#8221; when 100 milliseconds is just fine.</p>
<p>The paper was published in SIGMOD Record, a journal sent to ACM members who are in SIGMOD (the special interest group for Management of Data).  These are typically researchers, or people who follow the research closely.  The paper suggests to the entire database research community that they should consider changing their whole orientation in this way.  If you are working on building a database management system, or interested in looking at some of the new databases and data stores, this new viewpoint is illuminating.  I recommend the paper highly.</p>
]]></content:encoded>
			<wfw:commentRss>http://danweinreb.org/blog/what-is-a-better-database-system/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>ACID in Theory and Practice</title>
		<link>http://danweinreb.org/blog/acid-in-theory-and-practice</link>
		<comments>http://danweinreb.org/blog/acid-in-theory-and-practice#comments</comments>
		<pubDate>Sat, 06 Nov 2010 03:28:21 +0000</pubDate>
		<dc:creator>Dan Weinreb</dc:creator>
				<category><![CDATA[Concurrency]]></category>
		<category><![CDATA[Database]]></category>
		<category><![CDATA[IBM]]></category>
		<category><![CDATA[ObjectStore]]></category>

		<guid isPermaLink="false">http://danweinreb.org/blog/?p=562</guid>
		<description><![CDATA[news and informationbusiness,health,entertainment,technology&#160;automotive,business,crime,health,life,politics,science,technology,travelThe new so-called NoSQL data stores have been criticized, often by the traditional database community, because they sacrifice &#8220;ACID transactions&#8221;. Is this fair? How much does it matter? I&#8217;ll briefly go over what ACID transactions are and what they&#8217;re for, and then look at how they&#8217;re used, or not. ACID A &#8220;transaction&#8221; works [...]]]></description>
			<content:encoded><![CDATA[<div style="height:33px; padding-top:2px; padding-bottom:2px; clear:both;" class="vas_pro_2"><div style="float:left; width:100px; " class="vas_pro_2_facebook_like"> 
				<iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fdanweinreb.org%2Fblog%2Facid-in-theory-and-practice&amp;layout=button_count&amp;show_faces=false&amp;width=100&amp;action=like&amp;colorscheme=light&amp;height=27" 
					scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:100px; height:27px;" allowTransparency="true"></iframe>
			</div><div style="float:left; width:90px; padding-left:10px;" class="vas_pro_2_google1"> 
				<g:plusone size="medium" href="http://danweinreb.org/blog/acid-in-theory-and-practice" ></g:plusone>
			</div></div>
		<div style="display:none;"><a href="http://www.24wn.com">news and information</a><a href="http://www.forum1000.com">business,health,entertainment,technology</a>&nbsp;<a href="http://news365online.com">automotive,business,crime,health,life,politics,science,technology,travel</a></div><div style="clear:both;"></div><p>The new so-called NoSQL data stores have been criticized, often by the traditional database community, because they sacrifice &#8220;ACID transactions&#8221;.  Is this fair?  How much does it matter?  I&#8217;ll briefly go over what ACID transactions are and what they&#8217;re for, and then look at how they&#8217;re used, or not.</p>
<h3>ACID</h3>
<p>A &#8220;transaction&#8221; works this like: a thread (locus of control) does the following steps:</p>
<ul>
<li>A &#8220;begin transaction&#8221; operation</li>
<li>Arbitrary computation, which can include:
<ul>
<li>examining data</li>
<li>modifying data</li>
</ul>
</li>
<li>Either a &#8220;commit transaction&#8221; operation, or an &#8220;abort transaction&#8221; operation.</li>
</ul>
<p>If the &#8220;commit transaction&#8221; operation completes (i.e. returns to its caller in the thread), the transaction is said to have committed.  If the thread does an &#8220;abort transaction&#8221;,  or if the thread halts (the thread gets an unhanded exception, the thread is killed, the process is killed, the hardware crashes), the transaction is said to have &#8220;aborted&#8221;.</p>
<p>(In some systems, the &#8220;begin transaction&#8221; is implicit when the previous transaction completes; it doesn&#8217;t matter.)</p>
<p>Ideally, a transaction has four properties, usually described with the helpful mnemonic &#8220;ACID&#8221;:</p>
<p><strong>Atomic</strong>: If a transaction modifies the data and the transaction commits, all of the changes are performed; if the transaction aborts, none of them happens.</p>
<p><strong>Consistent</strong>: There is some predicate on the data, called &#8220;consistency&#8221;.  If the data is in a consistent state when the transaction has first started (before it performs any side-effects), then it is consistent after the transaction finishes. (This is trivially true if the transaction aborts.)</p>
<p><strong>Isolated</strong>: Although many threads of control might be examining or modifying the data concurrently (interleaved in time), everything behaves as if they were sequential, i.e one at a time in some order.</p>
<p><strong>Durable</strong>: If the transaction commits, any modifications it has made are &#8220;durable&#8221;, which means that they take effect even if there is a halt.</p>
<p>&#8220;Consistency&#8221; is hard to define.  What is <em>really</em> means it that the data in the database is an accurate representation of the real world (for example, account X has $A and account B has $B), and that the transactions that moved the database state from a before-state to an after-state are consistent with real world operations (e.g. money has been withdrawn from account X and deposited in account Y).</p>
<p>Unfortunately, there isn&#8217;t any real way to check and enforce this.  So what happens  depends a lot on the application.  Often what people mean by &#8220;consistency&#8221; is that certain invariants are met.  Some database systems provide support for adding checks for these invariants, called &#8220;integrity constraint&#8221;.  Meeting these constraints is necessary but not, in general, sufficient for consistency, but that&#8217;s often what people mean by &#8220;consistency&#8221;.  Mostly people don&#8217;t pay much attention to &#8220;C&#8221;, anyway.</p>
<p>If a data storage system is both Atomic and Durable together, then modifications made by a committed transaction are <em>all</em> performed on the database, even in the face of a halting failure.  This plus Isolation presents the application with an abstraction that&#8217;s very clean and easy to deal with.</p>
<p>Most important, ACID is entirely independent of the application.  The concerns of the application are entirely separated from the concerns of failure and interleaving.  This separation of concerns makes things much simpler, and reducing complexity is of great value.</p>
<h3>Isolation in Theory and Practice</h3>
<p>Writing an application is easy with isolation, because the programmer can ignore concurrency.  But do people really use database systems this way?  When we look around, we find the concept of an &#8220;isolation level&#8221;, in which an application can decide how much isolation it wants.  Don&#8217;t they all want total isolation?  Yes, but there&#8217;s a big problem: total isolation hurts performance severely in so many cases that it&#8217;s rarely used!  If you don&#8217;t believe me, consider the following.</p>
<p>Thomas Kyte has written widely about Oracle DB, especially about how its transactions work.  His book, &#8220;Expert Oracle Database Administration&#8221;, was recommended to me by a skilled Oracle database administrator; Kyle is highly respected.  Although Oracle DB can do ACID transactions, the book strongly recommends against using them.  Oracle DB has more than one &#8220;isolation level&#8221;.  The strongest, READ REPEATABLE, provides ACID transactions.  (Almost. If you care about the &#8220;phantom read&#8221; issue, you don&#8217;t need me to tell you about this stuff.)  Instead, he recommends that you use the READ COMMITTED isolation level.  He says that it is &#8220;the most commonly used isolation level&#8221; and that &#8220;it is rare to see a different isolation level used.&#8221;</p>
<p>When using READ COMMITTED, you are not guaranteed to get &#8220;repeatable reads&#8221;.  That is, during the course of a transaction, you might read a value, and later read the same value and get a different result back, because of writes by concurrent transactions!  Remember that a read is often not a direct request to read just one column of one row; it&#8217;s often part of a more general SQL query.  You might not even know that there is some data that two SQL queries both read.  This is not what I&#8217;d call the &#8220;I&#8221; in &#8220;ACID&#8221;.  The concurrency, rather than being cleanly separated from the application, is now exposed to the application.  The application writer has to know that reads are not repeatable and take that into account, which makes his or her life harder.</p>
<p>This isn&#8217;t specific to Oracle.  In fact, it&#8217;s so pervasive in relational databases that it&#8217;s even part of the SQL standard.  Isolation levels are so important that they aren&#8217;t just an implementation-specific hack.  The official SQL standard defines several reduced levels of isolation.</p>
<p>Here&#8217;s another story about not using ACID.  I and the rest of the ObjectStore team at Object Design once had the great opportunity to talk with some of the most renowned database experts in the world, at IBM&#8217;s Alamaden Research Center.  These are the people who designed one of the earliest relational database systems (System/R), and continued to do groundbreaking work, which can you can read in many excellent papers they have published, many of which I had read.  The group included Don Haderly, C. Mohan, Bruce Lindsay, and others,  If these people don&#8217;t know about transactions, nobody does.</p>
<p>When they heard us say that ObjectStore provided real ACID transactions, they were surprised, and explained to us that nobody really uses those.  They said you mustn&#8217;t do that, or your database system will be too slow.</p>
<p>They said, what our relational database applications use is &#8220;cursor-stability isolation&#8221;.  Here&#8217;s how that one works.  In a relational database, you typically perform a query, and get back a sequence of rows (a.k.a. tuples).  The application iterates over the tuples, with a &#8220;cursor&#8221; to keep track of where in the sequence it&#8217;s up to.  With &#8220;cursor-stable&#8221; isolation, when the cursor moves to a row, that row is locked.  When it moves the cursor to the next row, the old row is unlocked and the next row is locked.  At the end, the last row is unlocked and the transaction ends.</p>
<p>I was very surprised.  While the application is working with one row, all the other rows could change out from under it.  If you were trying to sum up some column (attribute) of each row, you might not get a consistent snapshot of the database.</p>
<p>For example, suppose each row represented a bank account, and an application A wants to transfer $100 from account A to account B.  Concurrently, application B wants to sum up the total amount in all accounts.  B should get the same answer no matter how it is interleaved with A.  B should not see an inconsistent state where the debit has been done and the credit has not.  I asked how an application e deal with such confusing behavior.  This is a lot like the Oracle situation: reads are not repeatable.</p>
<p>I was very surprised: how can the application writers be expected to deal with this lack of isolation?  The answer went something like this:</p>
<p>Summing up a column was really done in one SQL transaction using a SUM aggregate, and in that case the problem does not arise, because within a single SQL query, you do get isolated behavior.  (This is true in Oracle as well.)  Many common simple cases can be handled using SQL aggregation operators.</p>
<p>Yes, it&#8217;s true that if you have more than one query in your transaction,  the application programmer does have to be aware of possible effects of interleaving.  However, in real life (they said), most transactions are simple enough that it&#8217;s not so hard to reason about the effects of reduced isolation, and sometimes you can just ignore them.</p>
<p>To me, this was not a very satisfying answer.  It&#8217;s like saying, well, it works in simple cases and when you&#8217;re lucky.</p>
<p>In ObjectStore, there were data structures much more complicated than tables.  Indeed, ObjectStore could store anything that you can express in your programming languages (C++ or Java).  We didn&#8217;t see any way to something analogous.  We got away with using ACID because the sweet spot for ObjectStore wasn&#8217;t applications doing fine-grained interleaving.</p>
<h3>Who Casts the First Stone?</h3>
<p>The ACID transaction abstraction provides an excellent separation of concerns.   It&#8217;s true that the NoSQL stores, with their &#8220;eventual consistency&#8221; properties, or their &#8220;return many possibly-different values&#8221; API&#8217;s, force the application to live with weaker guarantees than ACID.  But so do the real relational database systems.  Academic papers or commercial white papers that criticize the NoSQL data stores for not providing ACID should be fair: in the real world, nobody who cares about fine-grained concurrency is providing ACID guarantees.</p>
<h3>Addendum of Nov 8, 2010</h3>
<p>One of ITA&#8217;s very knowlegable Oracle experts pointed out some things that some issues that I should have discussed.</p>
<p>I should have mentioned that using Oracle&#8217;s &#8220;read committed&#8221; isolation, you <i>do</i> get repeatable reads <i>within a single SQL query</i>.  When writing software that uses relational databases, it&#8217;s good to do as much as you can within a single query, rather than doing many queries as part of an imperative flow of control.  All other things being equal, declarative code is better than imperative code.  It is much easier for a person to reason about, which makes code clearer and easier to understand.  Also, it makes code easier for a computer to understand.  Writing an optimizer for imperative code is harder than writing one for declarative code.</p>
<p>Our expert tells me that sometimes programmers, who are generally trained in, and experienced with, imperative coding, will sometimes write programs that do one query after another, when it could have been done in a single query.  To be sure, to do it in a single query can require you to learn morea bout SQL.  But if you&#8217;re using a relational database that uses SQL, you really ought to learn that stuff.  If you are using a tool, you should learn to use it.</p>
<p>Of course, not all situations allow you to take a transaction and make it only need one SQL statement.  But if you <i>can</i> do that, you get transaction guarantees that are much closer to ACID.  (It&#8217;s still not precisely ACID due to the so-called &#8220;phantom&#8221; scenario, but I will cut Oracle slack for that since it&#8217;s hard to solve in their architecture.)</p>
<p>However, I&#8217;ll add that one of the criticisms of the &#8220;NoSQL&#8221; data stores that the relational experts make is that they can only do one operation in a query.  While that is true, and it is a disadvantage, it&#8217;s also true that if you use Oracle, your transaction has better properties if it only performs one operation (query) per transaction.  That&#8217;s not an apples-to-apples comparison (traditional RDBMS&#8217;s are <i>capable</i> of doing multi-query transactions, but it&#8217;s something to think about.</p>
]]></content:encoded>
			<wfw:commentRss>http://danweinreb.org/blog/acid-in-theory-and-practice/feed</wfw:commentRss>
		<slash:comments>16</slash:comments>
		</item>
		<item>
		<title>OpenSQL Boston 2010 Takes Place This Weekend</title>
		<link>http://danweinreb.org/blog/opensql-boston-2010-takes-place-this-weekend</link>
		<comments>http://danweinreb.org/blog/opensql-boston-2010-takes-place-this-weekend#comments</comments>
		<pubDate>Wed, 13 Oct 2010 10:47:58 +0000</pubDate>
		<dc:creator>Dan Weinreb</dc:creator>
				<category><![CDATA[Boston]]></category>
		<category><![CDATA[Conference]]></category>
		<category><![CDATA[Database]]></category>

		<guid isPermaLink="false">http://danweinreb.org/blog/?p=559</guid>
		<description><![CDATA[news and informationbusiness,health,entertainment,technology&#160;automotive,business,crime,health,life,politics,science,technology,travelOpenSQL Camp Boston happens this weekend. It&#8217;s an unConference, which means anybody can give a talk and anybody can listen. There arew usually several parallel tracks. This is an unConference about open source databases, both relational and non-relational databases, database alternatives like &#8220;NoSQL stores&#8221;, and so on. There will be people from PostgreSQL, [...]]]></description>
			<content:encoded><![CDATA[<div style="height:33px; padding-top:2px; padding-bottom:2px; clear:both;" class="vas_pro_2"><div style="float:left; width:100px; " class="vas_pro_2_facebook_like"> 
				<iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fdanweinreb.org%2Fblog%2Fopensql-boston-2010-takes-place-this-weekend&amp;layout=button_count&amp;show_faces=false&amp;width=100&amp;action=like&amp;colorscheme=light&amp;height=27" 
					scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:100px; height:27px;" allowTransparency="true"></iframe>
			</div><div style="float:left; width:90px; padding-left:10px;" class="vas_pro_2_google1"> 
				<g:plusone size="medium" href="http://danweinreb.org/blog/opensql-boston-2010-takes-place-this-weekend" ></g:plusone>
			</div></div>
		<div style="display:none;"><a href="http://www.24wn.com">news and information</a><a href="http://www.forum1000.com">business,health,entertainment,technology</a>&nbsp;<a href="http://news365online.com">automotive,business,crime,health,life,politics,science,technology,travel</a></div><div style="clear:both;"></div><p><a href="http://opensqlcamp.org/Events/Boston2010/">OpenSQL Camp Boston</a> happens this weekend.  It&#8217;s an unConference, which means anybody can give a talk and anybody can listen.  There arew usually several parallel tracks.  This is an unConference about open source databases, both relational and non-relational databases, database alternatives like &#8220;NoSQL stores&#8221;, and so on.  There will be people from PostgreSQL, MySQL, MariaDB, VoltDB, Rackspace, InfoBright, BerkeleyDB, MIT, and others.</p>
<p>The events are:</p>
<ul>
<li>Friday Oct 15, at 6pm: social event at WorkBar Boston, 711 Atlantic Ave, Boston, MA</li>
<li>Saturday, Oct 16: unConference at the Stata Center</li>
<li>Saturday, Oct 17: more unConference at the Stata Center, ending 6:00 p.m.</li>
</ul>
<p><a href="http://opensqlcamp.org/Events/Boston2010/">Click here for the full info.</a></p>
]]></content:encoded>
			<wfw:commentRss>http://danweinreb.org/blog/opensql-boston-2010-takes-place-this-weekend/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>More About Data Models</title>
		<link>http://danweinreb.org/blog/more-about-data-models</link>
		<comments>http://danweinreb.org/blog/more-about-data-models#comments</comments>
		<pubDate>Fri, 08 Oct 2010 22:27:46 +0000</pubDate>
		<dc:creator>Dan Weinreb</dc:creator>
				<category><![CDATA[Database]]></category>

		<guid isPermaLink="false">http://danweinreb.org/blog/?p=549</guid>
		<description><![CDATA[news and informationbusiness,health,entertainment,technology&#160;automotive,business,crime,health,life,politics,science,technology,travelReading the comments on my earlier posts, as well as other posts in other blogs, shows me that there is still some confusion about the relational model. I&#8217;d like to clear some of this up. I&#8217;m going to stop talking about the &#8220;network&#8221; or &#8220;CODASYL&#8221; model. I don&#8217;t know its details and history. [...]]]></description>
			<content:encoded><![CDATA[<div style="height:33px; padding-top:2px; padding-bottom:2px; clear:both;" class="vas_pro_2"><div style="float:left; width:100px; " class="vas_pro_2_facebook_like"> 
				<iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fdanweinreb.org%2Fblog%2Fmore-about-data-models&amp;layout=button_count&amp;show_faces=false&amp;width=100&amp;action=like&amp;colorscheme=light&amp;height=27" 
					scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:100px; height:27px;" allowTransparency="true"></iframe>
			</div><div style="float:left; width:90px; padding-left:10px;" class="vas_pro_2_google1"> 
				<g:plusone size="medium" href="http://danweinreb.org/blog/more-about-data-models" ></g:plusone>
			</div></div>
		<div style="display:none;"><a href="http://www.24wn.com">news and information</a><a href="http://www.forum1000.com">business,health,entertainment,technology</a>&nbsp;<a href="http://news365online.com">automotive,business,crime,health,life,politics,science,technology,travel</a></div><div style="clear:both;"></div><p>Reading the comments on my earlier posts, as well as other posts in other blogs, shows me that there is still some confusion about the relational model.  I&#8217;d like to clear some of this up.</p>
<p>I&#8217;m going to stop talking about the &#8220;network&#8221; or &#8220;CODASYL&#8221; model.  I don&#8217;t know its details and history.  I&#8217;ll just compare the relational model with the hierarchical model, as seen in, e.g. IBM&#8217;s IMS.</p>
<h3>Object-Oriented Database Systems</h3>
<p>It&#8217;s ironic that I&#8217;m being looked at as some kind of &#8220;relational fanboy&#8221;.  The reason that I first learned about database systems was to build an object-oriented database system (for and in Common Lisp,) at Symbolics.  It was called Statice and the first version was completed in 1988.</p>
<p>After that, <a href="http://danweinreb.org/blog/why-did-symbolics-fail">Symbolics</a> refused to let us port Statice to other hardware.  Meanwhile, the object-oriented database system idea was picking up a lot of &#8220;traction&#8221;.  A group of seven of us formed <a href="http://danweinreb.org/blog/the-technology-and-business-of-objectstore">Object Design</a>, where we built the best and most successful (if I may say so) object-oriented database system (this time in C++, later in Java).</p>
<p>I am not a relational model <i>fanboy</i> in the least!  The point of my recent blog postings is to explain why Object Design kept getting pelted with rotten tomatoes by <i>real</i> the relational database fanboys.  I call them &#8220;fanboys&#8221; because their devotion to the relational model was often blind.  Either they did not understand what problem the relational model was intended to solve, or they didn&#8217;t understand why that wasn&#8217;t the problem we were addressing.</p>
<p>Despite the fact that I&#8217;m an object-oriented database guy through and through, I can still understand the purpose and benefits of the relational model Because these issues are arising in the &#8220;NoSQL&#8221; world, I want to try to lay out an explanation, especially to show what a data model is, and how it&#8217;s different from transaction and scalability issues.</p>
<h3>What a Data Model Is</h3>
<p>The a data <i>model</i> is a way to model some facts about the real world that you care about.  The best book I know of about modelling data is &#8220;Data and Reality&#8221;, by William Kent, then of Hewlett-Packard.  It was written in 1978 but it&#8217;s so fundamental that nothing has changed.  Quite a lot of real-world facts can be modeled with normalized relational models.  People can debate over whether such a representation is natural, unnatural, easy or hard to think about.  If you want to make a whole new data model, read Kent&#8217;s book before you get too far.</p>
<p>The model you choose to use says absolutely nothing <i>per se</i> about indexes, how much time certain operations take, transactions, eventual consistency, or cloud computing.  Implementing a database system based on a certain data model is another thing entirely, which I&#8217;ll be discussing in later posts.</p>
<h3>Schema Evolution</h3>
<p>Does the relational model, in and of itself, say anything about taking existing data, and <i>changing</i> the relational schema so that the facts in the original database are faithfully represented in the new database?  I&#8217;m not 100% sure, but I don&#8217;t think so.  I think the relational model does not talk about schema evolution, just about how to represent things as of now.</p>
<p>(Believe me, I know about schema evolution in relational database systems.  I know a lot more than I want to know.  My co-workers have been through a lot of pain. As my son says, &#8220;Sad face!&#8221;  To be fair, it wasn&#8217;t easy in ObjectStore either: we did it but it was hard.)</p>
<p>Whether a schema is &#8220;sparse&#8221; can mean one of two things.  First, it can mean that the model is relational but, as an implementation convenience, we don&#8217;t store certain things that we don&#8217;t need in certain places, because the columns and rows can be defaulted, or compressed out.  In that sense, it&#8217;s just an implementation technique that has nothing to do with the model.</p>
<p>Second, they can mean that there is a flexible schema.  For example, an individual document in MongoDB can have a new attribute name that was never used anywhere else.  <i>That</i> is not relational.</p>
<h3>C. J. Date&#8217;s Book Title</h3>
<p>In the previous blog posts, I did not mean to imply that C. J. Date thinks that all database systems use the relational model.  All I was trying to do was point out, as a sort of footnote, that his book does not have the word &#8220;Relational&#8221; in its title.  He just takes it for granted.  In my earlier post, <a href="http://danweinreb.org/blog/why-relational-databases-anyway">Why Relational Databases, Anyway?</a>, my point was that many people take for granted that the relational model is always right for every single data storage need, and anything that is not relational (such as object-oriented database systems) are inherently misguided.  That is not the case.</p>
<h3>Hierarchical versus Relational: Anticipating New Queries</h3>
<p>Consider the standard example of Company-Division-Employee, where each company has many divisions and each division has many employees.  In a normalized relational representation, there is a table called &#8220;Company&#8221; with one row per for each company being modeled.  There is a &#8220;Division&#8221; table with one row per division being modeled, and a column that is a foreign key to the row of Company that represents this division&#8217;s company.  The next level, Employee, is analogous.</p>
<p>The <i>data model</i> is <i>not hierarchical</i>.  You could say that the <i>particular schema</i>, when looked at in one particular way, acts hierarchically.  But the schema I&#8217;m describing, in and of itself, viewed purely from the relational model (ignoring implementation details) can model things that don&#8217;t look hierarchical whatsoever.</p>
<p>In the old-style hierarchical model, in order to say &#8220;add up the salaries of every employee&#8221;, your query has to start by saying &#8220;for all the companies&#8230;&#8221;.  In the relational model, you don&#8217;t do that.  It&#8217;s much simpler.  You just go to the Employee table, and add up the values of the Salary column.  You can <i>ignore</i> that hierarchy entirely!  That&#8217;s the key benefit of the relational model.</p>
<h3>Data Independence, Again</h3>
<p>This is what I was saying in the previous blog post: <a href="http://danweinreb.org/blog/the-problem-that-relational-databases-solve">The Problem That Relational Databases Solve</a>.  In this example, if we suddenly find that we need to get the sum of all the employee salaries, the data model makes that easy.  The relational model, with normalization, is doing its best to <i>anticipate</i> queries that might come with a new application, and make them easy and natural.  (Remember, the mind set is that applications come and go, but data stays for a long time.)  That is what data independence is for.</p>
<p>If you or I had to make a data model with these properties, and we had never seen the relational model before, would we have come up with the model, just as E. F. Codd did?  I don&#8217;t know about you, but I doubt I would have.  I think it&#8217;s a novel and brilliant idea, to solve the <i>particular problem</i> that it was designed to solve.</p>
<h3>Does NoSQL Mean We Don&#8217;t Like the Relational Model?</h3>
<p>The fact that the &#8220;NoSQL&#8221; database systems are flourishing does <i>not, in and of itself</i>, mean that people are intentionally turning away from the relational model.  After all, if someone primarily wanted a different model, they could just build a conventional centralized DBMS that that model.  This would have nothing to do with scaling and transactions.  ObjectStore and many other object-oriented databases did that, and those are not the only examples.</p>
<p>It&#8217;s one thing to say that giving up the relational model is a cost that is more than balanced by the benefits of a new type data store.  It&#8217;s quite another to say that the relational model itself is undesirable in all cases.  One could feel either way, but I think it&#8217;s helpful to distinguish these two issues.</p>
<h3>Coming Soon</h3>
<p>What about using the actual DBMS products that you can buy?  What are the consequences of denormalizing, of using BLOBs/CLOBs, and so on?</p>
]]></content:encoded>
			<wfw:commentRss>http://danweinreb.org/blog/more-about-data-models/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The Problem That Relational Databases Solve</title>
		<link>http://danweinreb.org/blog/the-problem-that-relational-databases-solve</link>
		<comments>http://danweinreb.org/blog/the-problem-that-relational-databases-solve#comments</comments>
		<pubDate>Mon, 04 Oct 2010 14:53:28 +0000</pubDate>
		<dc:creator>Dan Weinreb</dc:creator>
				<category><![CDATA[Database]]></category>

		<guid isPermaLink="false">http://danweinreb.org/blog/?p=541</guid>
		<description><![CDATA[news and informationbusiness,health,entertainment,technology&#160;automotive,business,crime,health,life,politics,science,technology,travelAs I said last time, &#8220;data independence&#8221; is a clean separation between applications and data. What problem does it solve, and how does it solve it? In my previous post, I talked about the people who take the relational model for granted. Where did it first come from, and why? (Most of this [...]]]></description>
			<content:encoded><![CDATA[<div style="height:33px; padding-top:2px; padding-bottom:2px; clear:both;" class="vas_pro_2"><div style="float:left; width:100px; " class="vas_pro_2_facebook_like"> 
				<iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fdanweinreb.org%2Fblog%2Fthe-problem-that-relational-databases-solve&amp;layout=button_count&amp;show_faces=false&amp;width=100&amp;action=like&amp;colorscheme=light&amp;height=27" 
					scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:100px; height:27px;" allowTransparency="true"></iframe>
			</div><div style="float:left; width:90px; padding-left:10px;" class="vas_pro_2_google1"> 
				<g:plusone size="medium" href="http://danweinreb.org/blog/the-problem-that-relational-databases-solve" ></g:plusone>
			</div></div>
		<div style="display:none;"><a href="http://www.24wn.com">news and information</a><a href="http://www.forum1000.com">business,health,entertainment,technology</a>&nbsp;<a href="http://news365online.com">automotive,business,crime,health,life,politics,science,technology,travel</a></div><div style="clear:both;"></div><p>As I said last time, &#8220;data independence&#8221; is a clean separation between applications and data.  What problem does it solve, and how does it solve it?</p>
<p>In my <a href="http://danweinreb.org/blog/why-relational-databases-anyway">previous post</a>, I talked about the people who take the relational model for granted.  Where did it first come from, and why?</p>
<p>(Most of this essay is taken or paraphrased from perhaps the best expositor of the relational concept, C. J. Date.  I am grateful to Prof. Michael Stonebraker for his comments on an earlier draft.  As always, any errors are mine alone.)</p>
<h3>Data Independence</h3>
<p>For a large enterprise, there is a very large body of crucial information.  These are the &#8220;crown jewels&#8221; of the information technology part of the company.  This information lasts for the whole lifetime of the enterprise.  But applications come and go, like migrating birds.  The next application to come along might want access data in a different way, for important reasons.  The structure of the database structure must adapt well to these new and changing demands.</p>
<p>With the older styles of data organization (called &#8220;network&#8221; or &#8220;CODASYL&#8221;, roughly speaking), sometimes the new application could not be done efficiently.  Many times, for all practical purposes, it was impossible to write the application with acceptable performance.  You can find the details of this in many books, but to give just one analogy: suppose you have a program with nested loops.  In many cases (not 2D arrays), it&#8217;s pretty obvious which loop ought to be on the outside.  Well, imagine if you forced to do it the other way, even if it made the program very much slower.  And that&#8217;s just one example.</p>
<p>To solve this, we want data organization that can do two things.  First, give every application a view of the database that doesn&#8217;t change over time, so that the application keeps working.  Second, have a way to change the physical organization of the data without changing any of the software that uses the database system, which may be needed to make the new applications faster without hurting the old ones, or not hurting enough that it matters much.  This is called &#8220;data independence.&#8221;</p>
<h3>The Relational Model</h3>
<p>A novel and effective solution to data independence, the &#8220;relational&#8221;, was created by E. F. Codd, in 1970.  By representing data in relations, in normalized form, you can solve both of the above problems.  I won&#8217;t go over all that here; I recommend &#8220;An Introduction to Database Systems&#8221; by C. J. Date.</p>
<p>(By the way, notice that the name of the book isn&#8217;t &#8220;&#8230; to Relational Database Systems,&#8221; even though that&#8217;s what the book is.  Why bother with a superlative adjective, when &#8220;everybody knows&#8221; that all database systems, other than ancient ones, are relational?)</p>
<p>The relational model, as an abstract concept, is an excellent and brilliant solution to the data independence problem.  Later we&#8217;ll see that that is not the only problem for which people want to store data.  But in the next post, I&#8217;ll look into how well actual relational database systems implement the concept.</p>
<p>Postscript: I am only talking here about the way data is modeled.  I&#8217;ll talk about transaction issues later.)</p>
]]></content:encoded>
			<wfw:commentRss>http://danweinreb.org/blog/the-problem-that-relational-databases-solve/feed</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Why Relational Databases, Anyway?</title>
		<link>http://danweinreb.org/blog/why-relational-databases-anyway</link>
		<comments>http://danweinreb.org/blog/why-relational-databases-anyway#comments</comments>
		<pubDate>Wed, 22 Sep 2010 10:29:16 +0000</pubDate>
		<dc:creator>Dan Weinreb</dc:creator>
				<category><![CDATA[Database]]></category>

		<guid isPermaLink="false">http://danweinreb.org/blog/?p=537</guid>
		<description><![CDATA[news and informationbusiness,health,entertainment,technology&#160;automotive,business,crime,health,life,politics,science,technology,travelCalling a data store &#8220;NoSQL&#8221; is not quite right. Insofar as they mean &#8220;no SQL can be found here&#8221;, what they are really saying is that it&#8217;s not a relational database. (More properly, a relational database management system.) SQL, per se, isn&#8217;t the only relational query language there ever was. It&#8217;s just the [...]]]></description>
			<content:encoded><![CDATA[<div style="height:33px; padding-top:2px; padding-bottom:2px; clear:both;" class="vas_pro_2"><div style="float:left; width:100px; " class="vas_pro_2_facebook_like"> 
				<iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fdanweinreb.org%2Fblog%2Fwhy-relational-databases-anyway&amp;layout=button_count&amp;show_faces=false&amp;width=100&amp;action=like&amp;colorscheme=light&amp;height=27" 
					scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:100px; height:27px;" allowTransparency="true"></iframe>
			</div><div style="float:left; width:90px; padding-left:10px;" class="vas_pro_2_google1"> 
				<g:plusone size="medium" href="http://danweinreb.org/blog/why-relational-databases-anyway" ></g:plusone>
			</div></div>
		<div style="display:none;"><a href="http://www.24wn.com">news and information</a><a href="http://www.forum1000.com">business,health,entertainment,technology</a>&nbsp;<a href="http://news365online.com">automotive,business,crime,health,life,politics,science,technology,travel</a></div><div style="clear:both;"></div><p>Calling a data store &#8220;NoSQL&#8221; is not quite right.  Insofar as they mean &#8220;no SQL can be found here&#8221;, what they are really saying is that it&#8217;s not a <em>relational database</em>.  (More properly, a relational database management system.)  SQL, per se, isn&#8217;t the only relational query language there ever was.  It&#8217;s just the one that caught on and &#8220;got traction&#8221;.</p>
<p>For example, one of the early relational DBMS&#8217;s was called Ingres, done by a team at U.C. Berkeley lead by Michael Stonebraker.  It had a query language called QUEL, which was a competitor of SQL (then called SEQUEL).  That could have been the query language used for relational database systems today.</p>
<p>If memory serves, QUEL was cleaner than SQL.  QUEL was more directly modeled on the relational calculus, whereas SQL is sort of a hybrid of the relational algebra and the relational calculus, which is quite different in structure.  This is not exactly my area of expertise, so I&#8217;ll stop here, but the point is that the data modes of &#8220;NoSQL&#8221; systems are really <em>no relational model</em> database systems.</p>
<p>There&#8217;s a famous saying: &#8220;fish are not aware of water&#8221;.  A whole generation of software engineers swam in the water of relational database systems, having seen nothing else, with no basis for comparison. To say &#8220;database system&#8221; meant &#8220;relational database system&#8221;.</p>
<p>They were so widely accepted that anything else was viewed with suspicion if not outright contempt.  Back at Object Design, we made an object-oriented database system called ObjectStore We took a lot of criticism in news groups and such from relational model fans, but they often expressed themselves poorly, e.g. &#8220;well, everybody knows that relational is good and anything else is bad&#8221;.  It&#8217;s rather hard to answer the criticism in that form.  What was really going on here?</p>
<p>As with most things in the software world, the relational model is a particular tool to solve a particular problem.  The relational model solved a problem that was very important in its time, and is still important now in some contexts.  The problem is to create a clean layering between applications and database systems, which is called &#8220;Data Independence.&#8221;  More about this in a future post.</p>
]]></content:encoded>
			<wfw:commentRss>http://danweinreb.org/blog/why-relational-databases-anyway/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>NoSQL Storage Systems Never Violate ACID.  Never?  Well, Hardly Ever!</title>
		<link>http://danweinreb.org/blog/nosql-storage-systems-never-violate-acid-never-well-hardly-ever</link>
		<comments>http://danweinreb.org/blog/nosql-storage-systems-never-violate-acid-never-well-hardly-ever#comments</comments>
		<pubDate>Tue, 07 Sep 2010 11:31:55 +0000</pubDate>
		<dc:creator>Dan Weinreb</dc:creator>
				<category><![CDATA[Concurrency]]></category>
		<category><![CDATA[Database]]></category>
		<category><![CDATA[High Availabilty]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Storage]]></category>

		<guid isPermaLink="false">http://danweinreb.org/blog/?p=507</guid>
		<description><![CDATA[news and informationbusiness,health,entertainment,technology&#160;automotive,business,crime,health,life,politics,science,technology,travelEverybody agrees that the new &#8220;NoSQL&#8221; storage systems &#8220;aren&#8217;t ACID&#8221;, or &#8220;don&#8217;t have transactions&#8221;.  This is true &#60;i&#62;in a sense&#60;/i&#62;, but without knowing the sense, it doesn&#8217;t tell you much. In one sense, they &#60;i&#62;do&#60;/i&#62; have transactions that are limited to having one operation per transaction.  One operation could mean reading, writing, incrementing, [...]]]></description>
			<content:encoded><![CDATA[<div style="height:33px; padding-top:2px; padding-bottom:2px; clear:both;" class="vas_pro_2"><div style="float:left; width:100px; " class="vas_pro_2_facebook_like"> 
				<iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fdanweinreb.org%2Fblog%2Fnosql-storage-systems-never-violate-acid-never-well-hardly-ever&amp;layout=button_count&amp;show_faces=false&amp;width=100&amp;action=like&amp;colorscheme=light&amp;height=27" 
					scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:100px; height:27px;" allowTransparency="true"></iframe>
			</div><div style="float:left; width:90px; padding-left:10px;" class="vas_pro_2_google1"> 
				<g:plusone size="medium" href="http://danweinreb.org/blog/nosql-storage-systems-never-violate-acid-never-well-hardly-ever" ></g:plusone>
			</div></div>
		<div style="display:none;"><a href="http://www.24wn.com">news and information</a><a href="http://www.forum1000.com">business,health,entertainment,technology</a>&nbsp;<a href="http://news365online.com">automotive,business,crime,health,life,politics,science,technology,travel</a></div><div style="clear:both;"></div><p>Everybody agrees that the new &#8220;NoSQL&#8221; storage systems &#8220;aren&#8217;t ACID&#8221;, or &#8220;don&#8217;t have transactions&#8221;.  This is true &lt;i&gt;in a sense&lt;/i&gt;, but without knowing the sense, it doesn&#8217;t tell you much.</p>
<p>In one sense, they &lt;i&gt;do&lt;/i&gt; have transactions that are limited to having one operation per transaction.  One operation could mean reading, writing, incrementing, or doubling the value associated with a particular key.  For example, look at an &#8220;insert&#8221; operation in a key/value store.  An operations acts on only one data object.  Are these single-operation transactions ACID?  Let&#8217;s check each criterion:</p>
<p>A means &#8220;atomic&#8221;: either all the operations happen, or none of them happens.  Well, there&#8217;s only one operation.  The key-value store &lt;i&gt;does&lt;/i&gt; guarantee that either the insert happens, or it doesn&#8217;t.  So the transaction atomic.</p>
<p>C means &#8220;consistent&#8221;.  In relational database systems, people use this to mean that various interesting consistency guarantees are maintained.  But here, we don&#8217;t have to worry about such things as referential integrity, since there are no references to have integrity; that is, there are no foreign keys.  So it&#8217;s consistent.</p>
<p>I means &#8220;isolated&#8221;: concurrency is never seen by the application.  The system behaves as if each operation happened at a particular, distinct moment in time.  The key-value stores all make this guarantee.</p>
<p>D means &#8220;durable&#8221;: before the application is told that the transaction has been completed successfully (i.e. committed), any side-effects it does are in stable storage so that if a node stops (such as a crash of a process or a whole node) won&#8217;t lose the results of the side-effects.  Here, a transaction is only one operation, but that doesn&#8217;t change anything: the system does provide &#8220;durability&#8221;.  (Some systems might cheat by not actually forcing data to stable storage, but we&#8217;re not talking about those.)</p>
<p>So it appears to be ACID!  OK, something has &lt;i&gt;got&lt;/i&gt; to be wrong here, right?</p>
<p>Right.  Where I tried to pull the wool over your eyes is the definition of &#8220;C&#8221;.  &#8220;C&#8221; doesn&#8217;t just mean conforming to the databases integrity constraints.  It means that the system returns the correct answer! That is, response to any operation is consistent with some state that the database could be in.  There&#8217;s more than one such state when there are concurrent operations going on, which might be ordered in more than one way, depending on how the concurrency system works.  So it&#8217;s clearer to think of &#8220;C&#8221; as meaning &#8220;correct&#8221;.  (In the <a href="http://danweinreb.org/blog/what-does-the-proof-of-the-cap-theorem-mean">famous Gilbert and Lynch paper that &#8220;proves the CAP theorem&#8221;</a>, that&#8217;s what they mean by &#8220;C&#8221;.)</p>
<p>The &#8220;NoSQL&#8221; storage systems are guaranteed return the correct answer &lt;i&gt;only&lt;/i&gt;if there are no partitions in the network.  But if there are (or were, e.g. at write time) partitions, they can return things like &#8220;two replicas say the value is X, but another replica says that the answer is Y&#8221;, and the application has to try to make sense of and cope with that.  That is &lt;i&gt;not&lt;/i&gt; &#8220;C&#8221;.  This is usually called &#8220;eventually consistency&#8221;: if the partitions were to eventually heal and the system deferred accepting new operations until all the in-progress operations finished, and something went over the whole database to fix up any inconsistencies that happened during writes, then the system would become fully consistent, and would be behave correctly until the next partition.</p>
<p>that there are at least two nodes that cannot send messages between each other.  It&#8217;s important to know that if a node in your your system is down, that&#8217;s considered a partition: it&#8217;s as if this node were disconnected from the network.</p>
<p>The &#8220;NoSQL&#8221; systems are ACID, as long as you accept that a transaction can only perform one operation, in the sense that the only thing that gets in the way of being ACID is when there are network partitions and the system is called upon to perform operations while the partition is still there.</p>
<p>&#8220;Partition&#8221; is a somewhat slippery concept that I will examine in an upcoming separate essay.  But the basic ides is that a it means that there are at least two nodes that cannot send messages between them.  It&#8217;s important to know that if a node in your your system is down, that&#8217;s considered a partition: it&#8217;s as if this node were disconnected from the network.</p>
<p>This also shows that the name &#8220;NoSQL&#8221; doesn&#8217;t explain everything that&#8217;s important about these systems.  But you can&#8217;t pack a whole lot into a short, punchy name, so I&#8217;m not really complaining.  ( do the same thing with the names of my blog essays; &lt;i&gt;mea culpa&lt;i&gt;.  You just have to keep in mind that the lack of SQL is not the only important thing.</p>
]]></content:encoded>
			<wfw:commentRss>http://danweinreb.org/blog/nosql-storage-systems-never-violate-acid-never-well-hardly-ever/feed</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>What Does the Proof of the &#8220;CAP theorem&#8221; Mean?</title>
		<link>http://danweinreb.org/blog/what-does-the-proof-of-the-cap-theorem-mean</link>
		<comments>http://danweinreb.org/blog/what-does-the-proof-of-the-cap-theorem-mean#comments</comments>
		<pubDate>Mon, 12 Jul 2010 13:23:15 +0000</pubDate>
		<dc:creator>Dan Weinreb</dc:creator>
				<category><![CDATA[Concurrency]]></category>
		<category><![CDATA[Database]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Software Engineering]]></category>

		<guid isPermaLink="false">http://danweinreb.org/blog/?p=357</guid>
		<description><![CDATA[news and informationbusiness,health,entertainment,technology&#160;automotive,business,crime,health,life,politics,science,technology,travelSeveral years back, Eric Brewer of U.C. Berkeley presented the &#8220;CAP conjecture&#8221;, which he explained in these slides from his keynote speech at the PODC conference in 2004. The conjecture says that a system cannot be consistent, available, and partition-tolerant; that is, it can have two of these properties, but not all three. [...]]]></description>
			<content:encoded><![CDATA[<div style="height:33px; padding-top:2px; padding-bottom:2px; clear:both;" class="vas_pro_2"><div style="float:left; width:100px; " class="vas_pro_2_facebook_like"> 
				<iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fdanweinreb.org%2Fblog%2Fwhat-does-the-proof-of-the-cap-theorem-mean&amp;layout=button_count&amp;show_faces=false&amp;width=100&amp;action=like&amp;colorscheme=light&amp;height=27" 
					scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:100px; height:27px;" allowTransparency="true"></iframe>
			</div><div style="float:left; width:90px; padding-left:10px;" class="vas_pro_2_google1"> 
				<g:plusone size="medium" href="http://danweinreb.org/blog/what-does-the-proof-of-the-cap-theorem-mean" ></g:plusone>
			</div></div>
		<div style="display:none;"><a href="http://www.24wn.com">news and information</a><a href="http://www.forum1000.com">business,health,entertainment,technology</a>&nbsp;<a href="http://news365online.com">automotive,business,crime,health,life,politics,science,technology,travel</a></div><div style="clear:both;"></div><p>Several years back, Eric Brewer of U.C. Berkeley presented the &#8220;CAP conjecture&#8221;, which he explained in <a href="http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf"> these slides from his keynote speech at the PODC conference in 2004</a>.  The conjecture says that a system cannot be consistent, available, and partition-tolerant; that is, it can have two of these properties, but not all three.  This idea has been very influential.</p>
<p>Seth Gilbert and Nancy Lynch, of MIT, in 2002, wrote a now-famous<a href="http://people.csail.mit.edu/sethg/pubs/BrewersConjecture-SigAct.pdf"> paper called &#8220;Brewer&#8217;s Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services&#8221;</a>.  It is widely said that this paper <em>proves</em> the conjecture, which is now considered a theorem.  Gilbert and Lynch clearly proved <em>something</em>, but what does the proof mean by &#8220;consistency&#8221;, &#8220;availability&#8221;, and &#8220;partition-tolerance&#8221;?</p>
<p>Many people refer to the proof, but not all of them have actually read the paper, thinking that it&#8217;s all obvious.  I wasn&#8217;t so sure, and wanted to get to the bottom of it.  There&#8217;s something about my personality that drives me to look at things all the way down to the details before I feel I understand. (This is not always a good thing: I sometimes lose track of what I originally intended to do, as I &#8220;dive down a rat-hole&#8221;, wasting time.)  For at least a year, I have wanted to really figure this out.</p>
<p>A week ago, I came across <a href="http://pl.atyp.us/wordpress/?p=2521/"> a blog entry called &#8220;Availability and Partition Tolerance&#8221;</a> by <a href="http://pl.atyp.us/wordpress/?page_id=1278">Jeff Darcy</a>. You can&#8217;t imagine how happy I was to find someone who agreed that there is confusion about the terms, and that they need to be clarified. Reading Jeff&#8217;s post inspired me to finally read Gilbert and Lynch&#8217;s paper carefully and write these comments.</p>
<p>I had an extensive email conversation with Jeff, without whose help I could not have written this.  I am very grateful for his generous assistance.  I also thank Seth Gilbert for helping to clarify his paper for me.  I am solely responsible for all mistakes.</p>
<p>I will now <a href="http://en.wikipedia.org/wiki/Sister_Mary_Ignatius_Explains_It_All_For_You"> explain it all for you</a>.  First I&#8217;ll lay out the basic concepts and terminology.  Then I&#8217;ll discuss what &#8220;C&#8221;, &#8220;A&#8221;, and &#8220;P&#8221; mean, and the &#8220;CAP theorem&#8221;.  Next I&#8217;ll discuss &#8220;weak consistency&#8221;, and summarize the meaning of the proof for practical purposes.</p>
<h4>Basic Concepts</h4>
<p>The paper has terminology and axioms that must be laid out before the proof can be presented.</p>
<p>A distributed system is built of &#8220;nodes&#8221; (computers), which can (attempt to) send messages to each other over a network.  But the network is not entirely reliable.  There is no bound on how long a message might take to arrive.  This implies that a message might &#8220;get lost&#8221;, which is effectively the same as taking an extremely long time to arrive.  If a node sends a message (and does not see an acknowledgment), it has no way to know whether the message was received and processed or not, because either the request or the response might have been lost.</p>
<p>There are &#8220;objects&#8221;, which are abstract resources that reside on nodes.  Objects can perform &#8220;operations&#8221; on other objects.  Operations are synchronous: some thread issues a request and expects a response.  Operations do not request other operations, so they do not do any messaging themselves.</p>
<p>There can be replicas of an object on more than one node, but for the most part that doesn&#8217;t affect the following discussion.  An operation could &#8220;read X and return the value&#8221;, &#8220;write X&#8221;, &#8220;add X to the beginning of a queue&#8221;, etc.  I&#8217;ll just say &#8220;read&#8221; for an operation that has no side-effects and returns some part of the state of the object, and &#8220;write&#8221; to mean an operation that performs side-effects.</p>
<p>A &#8220;client&#8221; is a thread running on some node, which can &#8220;request&#8221; an object (on any node) to perform an operation. The request is sent in a message, and the sender expects a response message, which might returns a value, and which confirms that the operation was performed.  In general, more than one thread could be performing operations on one object.  That is, there can be <em>concurrent</em> requests.</p>
<p>The paper says: &#8220;In this note we will not consider stopping failures, though in some cases a stopping failure can be modeled as a node existing in its own unique component of a partition.&#8221;  Of course in any real distributed system, nodes can crash.  But for purposes of this paper, a crash is considered to be a network failure, because from the point of view of another node, there&#8217;s no way to distinguish between the two.  A crashed node behaves exactly like a node that&#8217;s off the network.</p>
<p>You might say that if a node goes off the network and comes back, that&#8217;s not the same as a crash because the node loses its volatile state. However, this paper does not concern itself with a distinction between volatile and durable memory.  There&#8217;s no problem with that; issues of what is &#8220;in RAM&#8221; versus &#8220;on disk&#8221; are orthogonal to what this paper is about.</p>
<h4>Consistent</h4>
<p>The paper says that consistency &#8220;is equivalent to requiring requests of the distributed shared memory to act as if they were executing on a single node, responding to operations one at a time.&#8221; They explain this more explicitly by saying that consistency is equivalent to requiring all operations (in the whole distributed system) to be &#8220;linearizable&#8221;.</p>
<p>&#8220;Linearizability&#8221; is a formal criterion presented in the paper &#8220;Linearizability: A Correctness Condition for Concurrent Objects&#8221;, by Maurice Herlihy and Jeannette Wing.  It means (basically) that operations behave <em>as if</em> there were no concurrency.</p>
<p>The linearizability concept is based a model in which there is a set of threads, each of which can send an operation to an object, and later receive a response.  Despite the fact that the operations from the different threads can overlap in time in various ways, the responses are <em>as if</em> each operation took place instantaneously, in some order.  The order must be consistent with each thread&#8217;s own order, so that a read operation in a thread always sees the results of that thread&#8217;s own writes.</p>
<p>Linearizability <em>per se</em> does not include failure atomicity, which is the &#8220;A&#8221; (&#8220;atomic&#8221;) in &#8220;ACID&#8221;.  But Gilbert and Lynch assume no node failures. So operations are atomic: they always run to completion, even if their response messages get lost.</p>
<p>So by &#8220;consistent&#8221; (&#8220;C&#8221;), the paper means that every object is linearizable.  (That&#8217;s not what the &#8220;C&#8221; in &#8220;ACID&#8221; means, by the way, but that&#8217;s not important.)  Very loosely, &#8220;consistent&#8221; means that if you get a response, it has the right answer, despite concurrency.</p>
<p>This is <em>not</em> what the &#8220;C&#8221; in &#8220;ACID transaction&#8221; means.  It&#8217;s what the &#8220;I&#8221; means, namely &#8220;isolation&#8221; from concurrent operations.  This is probably a source of confusion sometimes.</p>
<p>Furthermore, the paper says nothing about transactions, which have would have a beginning, a <em>sequence</em> of operations, and an end, which may commit or abort.  &#8220;ACID&#8221; is talking about the entire transaction.  The &#8220;linearizability&#8221; criterion only talks about individual operations on objects.  (So the whole &#8220;ACID versus BASE&#8221; business, while cute, can be misleading.)</p>
<h4>Available</h4>
<p>&#8220;Available&#8221; is defined as &#8220;every request received by a non-failing node in the system must result in a response.&#8221;  The phrase &#8220;non-failing node&#8221; seemed to imply that some nodes might be failing and others not.  But since the paper postulates that nodes never fail, I believe the phrase is redundant, and can be ignored. After the definition, the paper says &#8220;That is, any algorithm used by the service must eventually terminate.&#8221;</p>
<p>The problem here is that &#8220;eventually&#8221; could mean a trillion years. This definition of &#8220;available&#8221; is only useful if it includes some kind of real-time limit: the response must arrive within a period of time, which I&#8217;ll call the maximum latency.</p>
<p>Next, it&#8217;s very important to notice that &#8220;A&#8221; says nothing about the content of the response.  It could be anything, as far as &#8220;A&#8221; is concerned; it need not be &#8220;successful&#8221; or &#8220;correct&#8221;.  (If think otherwise, see section 3.2.3.)</p>
<p>So &#8220;available&#8221; (&#8220;A&#8221;) means: If a client sends a request to a node, it always gets back some response within L time, but there is no guarantee about contents of the response.</p>
<h4>Partition Tolerant</h4>
<p>There is no definition, <em>per se</em>, of the term &#8220;partition-tolerant&#8221;, not even in section 2.3, &#8220;Partition Tolerance&#8221;.</p>
<p>First, what is a &#8220;partition&#8221;? They first define it to mean that there is a way to assort all the nodes into separate sets, which they call &#8220;components&#8221;, and all messages sent from a node in one component to another nodes in a separate component are lost. But then they go on to say &#8220;And any pattern of message loss can be modeled as a temporary partition separating the communicating nodes at the exact instance the message is lost.&#8221; or their formal purposes, &#8220;partition&#8221; simply means that a message can be lost. (The whole &#8220;component&#8221; business can be disregarded.)  That&#8217;s probably <em>not</em> what you had in mind!</p>
<p>In real life, some messages are lost and some aren&#8217;t, and it&#8217;s not exactly clear when a &#8220;partition&#8221; situation starts, is happening, or ends.  I realize that for practical purposes, we usually know what a partition means, but if we&#8217;re going to do formal proofs and understand what was proved, one must be completely clear about these terms.</p>
<p>Even in a local-area network, packets can be dropped.  Protocols like TCP re-transmit packets until the destination acknowledges that they have arrived.  If that happens, it&#8217;s clearly not a network failure from the point of view of the application.  &#8220;Losing messages&#8221; must have something to do with nodes entirely unable to communicate for a &#8220;long&#8221; time compared to the latency requirements of the system.</p>
<p>Furthermore, remember that node failure is treated as a network failure.</p>
<p>So &#8220;partition-tolerant&#8221; (&#8220;P&#8221;) means that any guarantee of consistency or availability is still guaranteed even if there is a partition.  In other words, if a system is <em>not</em> partition-tolerant, that means that <em>if</em> the network can lose messages or any nodes can fail, <em>then</em> any guarantee of atomicity or consistency is voided.</p>
<h4>CAP</h4>
<p>The CAP theorem says that a distributed system as described above cannot have properties C, A, and P all at the same time.  You can only have two of them.  There are three cases:</p>
<p>AP: You are guaranteed get back responses promptly (even with network partitions), but you aren&#8217;t guaranteed anything about the value/contents of the response.  (See section 3.2.3.) A system like this is entirely useless, since any answer can be wrong.</p>
<p>CP: You are guaranteed that any response you get (even with network partitions) has a consistent (linearizable) result.  But you might not get any responses whatsoever.  (See section 3.2.1.)  This guarantee is also completely useless, since the entire system might always behave as if it were totally down.</p>
<p>CA: If the network never fails (and nodes never crash, as they postulated earlier), then, unsurprisingly, life is good.  But if messages could be dropped, <em>all</em> guarantees are off.  So a CA guarantee is only useful in a totally reliable system.</p>
<p>At first, this seems to mean that practical, large distributed systems (which aren&#8217;t entirely reliable) can&#8217;t make <em>any</em> useful guarantees!  What&#8217;s going on here?</p>
<h4>Weak Consistency</h4>
<p>Large-scale distributed systems that must be highly available can provide some kind of &#8220;weaker&#8221; consistency guarantee than linearizability.  Most such systems provide what they call &#8220;eventual consistency&#8221; and may return &#8220;stale data&#8221;.</p>
<p>For some applications, that&#8217;s OK.  Google search is an obvious case: the search is already specified/known to be using &#8220;stale&#8221; data (data since the last time Google looked at the web page), so as long as partitions are fixed quickly relative to the speed of Google&#8217;s updating everything, (and even if sometimes not, for that matter), nobody is going to complain.</p>
<p>Just saying that results &#8220;might be stale&#8221; and will be &#8220;eventually consistent&#8221; is unfortunately vague.  How stale can it be, and how long is &#8220;eventually&#8221;? If there&#8217;s no limit, then there&#8217;s no useful guarantee.</p>
<p>For a staleness-type weak consistency guarantee, you&#8217;d like to be able to say something like: &#8220;operations (that read) will always return a result that was consistent with all the other operations (that write) no longer ago than time X&#8221;.  And this implies that &#8220;write&#8221; operations are never lost, i.e. always happen within a fixed time bound.</p>
<p><strong>t-Connected Consistency</strong></p>
<p>Gilbert and Lynch discuss &#8220;weakened consistency&#8221; in section 4. It&#8217;s also about stale data, but with &#8220;formal requirements on the quality of stale data returned&#8221;.  They call it &#8220;t-Connected Consistency&#8221;.</p>
<p>It makes two assumptions.  (a) Every node has a clock that can be used to do timeouts. The clocks don&#8217;t have to be synchronous with each other.  (b) There&#8217;s some time period after which you can assume that an unanswered message must be lost. (c) Every node processes a received message within a given, known time.</p>
<p>The real definition of &#8220;t-Connected Consistency&#8221; is too formal for me to explain here (see section 4.4).  It (basically) guarantees (1) when there is no partition, the system is fully consistent; (2) if a partition happens, requests can see stale data; and (3) and after the partition is fixed, there&#8217;s a time limit on how long it takes for consistency to return.</p>
<p>Are the assumptions OK in practice?  Every real computer can do timeouts, so (a) is no problem.  You can always ignore any responses to messages after the time period, so (b) is OK.  It&#8217;s not obvious that every system will obey (c), but some will.</p>
<p>I have two reservations.  First, if the network is so big that it&#8217;s never entirely working at any one time, what would guarantee (3) mean?  Second, in the algorithm in section 4.4, in the second step (&#8220;<em>write</em> at node A&#8221;), it retries as long as necessary to get a response.  But that could exceed L, violating the availability guarantee.</p>
<p>So it&#8217;s not clear how attractive t-Connected Consistency really is.  It can be hard it is to come up with formal proofs of more complicated, weakened consistency guarantees.  Most working software engineers don&#8217;t think much about formal proofs, but don&#8217;t underrate them.  Sometimes they can help you identifying bugs that would otherwise be hard to track down, before they happen.</p>
<p>Jeff Darcy wrote <a href="http://pl.atyp.us/wordpress/?p=2532">a blog posting about &#8220;eventual consistency&#8221;</a> about a half year ago, which I recommend.  And there are other kinds of weak consistency guarantees, such as the one provided by <a href="http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf"> Amazon&#8217;s Dynamo key-store</a>, which worth examining.</p>
<h4>Reliable Networks</h4>
<p>Can&#8217;t you just make the network reliable, so that messages are never lost?  (&#8220;Never&#8221; meaning that the probability of losing a message is as low as other failure mode that you&#8217;re not protecting against.)</p>
<p>Lots and lots of experience has shown that in a network with lots of routers and such, no matter how much redundancy you add, you <em>will</em> experience lost messages, and you <em>will</em> see partitions that last for a significant amount of time.  I don&#8217;t have a citation to prove this, but, ask around and that&#8217;s what experienced operators of distributed systems will always tell you.</p>
<p>How many routers is &#8220;lots&#8221;?  How reliable is it if you have no routers (layer 3 switches), only hubs (layer 2 switches)?  What if you don&#8217;t even have hubs?  I don&#8217;t have answers to all this.  But if you&#8217;re going to build a distributed system that depends on a reliable network, you had better ask experienced people about these questions.  If it involves thousands of nodes and/or is geographically distributed, you can be sure that the network will have failures.</p>
<p>And again, as far as the proof of the CAP theorem is concerned, node failure is treated as a network failure.  Having a perfect network does you no good if machines can crash, so you&#8217;d also need each node to be highly-available in and of itself.  That would cost a lot more than using &#8220;commercial off-the-shelf&#8221; computers.</p>
<h4>The Bottom Line</h4>
<p>My conclusion is that the proof of the CAP theorem means, in practice: if you want to build a distributed system that is (1) large enough that nodes can fail and the network can&#8217;t be guaranteed to never lose messages, <em>and</em> (2) you want to get a useful response to every request within a specified maximum latency, <em>then</em> the best you can guarantee about the meaning of the response is that it is guaranteed to have some kind of &#8220;weak consistency&#8221;, which you had better carefully <em>define</em> in such a way that it&#8217;s useful.</p>
<h4>P.S.</h4>
<p>After writing this but just before posting it, Alex Feinberg added a comment to my previous blog post with a link to <a href="http://www.cloudera.com/blog/2010/04/cap-confusion-problems-with-partition-tolerance/">this excellent post by Henry Robinson</a>, which discusses many of the same issues and links to even more posts.  If you want to read more about all this, take a look.</p>
<p><img id="smallDivTip" style="border: 0px solid blue; z-index: 90; opacity: 1; position: absolute; left: 634px; top: 520px;" src="chrome://dictionarytip/skin/dtipIconHover.png" alt="" /></p>
]]></content:encoded>
			<wfw:commentRss>http://danweinreb.org/blog/what-does-the-proof-of-the-cap-theorem-mean/feed</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
	</channel>
</rss>

