Participate in SPLASH 2011!

March 11th, 2011
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

The SPLASH Conference is my favorite technical conference every year. Wait, don’t stop reading just because you’re not a “researcher”!

I started this blog so that I could post a trip report from the conference in 2007. The name has since changed from “OOPSLA” to “SPLASH”. Actually it now has three tracks. Here’s what each is about:

  • OOPSLA: High quality research work that uses established scientific methodologies, written using high standards of academic technical publications.
  • Wavefront: Original and innovative architecture, design, and/or implementation techniques used in actual leading-edge software system. Our goal with Wavefront is to engage the software developers who are actually creating next generation of software systems and to make sure that their innovations are captured in the technical archives of computing.
  • Onward!: Innovative ideas that challenge existing beliefs, or early work well written and well argued for; essays and ideas worth hearing about. (This includes things too “far out” to get published in existing journals and other conferences.

If I am judging my own audience properly, Wavefront is what most of you would be interested in. Here is a great blog post by Allan Wirfs-Brock, who has been a leader in object-oriented and dynamic programming for decades. The Wavefront track is to revive what made the early OOPSLA conferences so energetic and valuable: interaction between researchers and people out there getting stuff done. Both groups have a lot to tell, and learn from, the other.

I am the “Panels Chair”, and I’m actively seeking people who would like to lead a panel, be on a panel, and/or suggest topics for panels. Panels are fun to be on. It’s far less work than submitting a paper. All you have to do is come up with five minutes of something to say, after which it’s all questions and answers. Let me know! Tell your friends!

The overall theme of the conference is The Internet as the world-wide Virtual Machine. This theme captures the change in the order of magnitude of computing that happened over the past few years. These days, software systems are rarely designed in isolation; they connect to pieces written by 3rd parties, they communicate with other pieces over the Internet, they use big data produced elsewhere, and they touch millions of interacting users through an ever larger variety of physical devices. In other words, the “machine” is now a global computing network. What does this entail for software development itself?

SPLASH 2011 will be in Portland, OR, October 22 – 27; the main tracks will probably be from Tuesday Oct 25 to Thursday Oct 27. The information is all here, including the calls for papers for each track. If you want to submit a paper, the deadline is April 8, 2001.

Just to get you started, and to show how broad the scope of the conference is, here are some possible areas. You could come up with something that impinges on one of these, but that’s not necessary. Panels can be in any of the tracks of SPLASH (OOPSLA, Wavefront, or Onward!).

  • Any aspect of software development, including prototyping, design, testing, evaluation, maintenance, reuse, static or dynamic analysis, frameworks and toolkits.
  • Language design issues, such as dynamic or static programming, type systems and type inference, use of modularity and parallelism, patterns. Dynamic languages are welcome. JavaScript, in particular, has become important lately, but any language is fine.
  • Language implementation issues: virtual machines, garbage collectors, compilers/interpreters, power efficiency.
  • Tools designed to reduce the time, effort, and/or cost of software systems.
  • And any of a wide range of topics: cloud computing and web platforms, mobile platforms, security and privacy issues, UI technology, location-awareness, storage, reliability.

I hope to see you there!

The Egyptian Revolution Was Organized

February 10th, 2011
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

[Note added much later, 5/31/2011: I recently learned more about what actually happened in Egypt, and now I don't believe what I say in this post. In fact, Facebook played a very major role in organizing the demonstrations.  It turns out that you can propose a demonstration using Facebook and get a lot of people to come to it, and that can have a huge role in triggering a revolt, even if the people know that there might be some danger.]

At least in the beginning, the current revolution in Egypt was guided by a small team. It was not entirely a spontaneous uprising. How do these things happen?

A few months ago, Malcolm Gladwell wrote an article in The New Yorker about the use of of Internet social media for social activism. He basically debunks a lot of what has been said about the use of Facebook and Twitter in recent revolutions. You may remember hearing about the great importance of Twitter in the Iran uprisings, e.g. how it was used to get information out of Iran, or allow the protesters to work together. He points out:
The people tweeting about the demonstrations were almost all in the West. “It is time to get Twitter’s role in the events in Iran right,” Golnaz Esfandiari wrote, this past summer, in Foreign Policy. “Simply put: There was no Twitter Revolution inside Iran”” The cadre of prominent bloggers, like Andrew Sullivan, who championed the role of social media in Iran, Esfandiari continued, misunderstood the situation. “Western journalists who couldn’t reach — or didn’t bother reaching — people on the ground in Iran simply scrolled through the English-language tweets post with tag #iranelection,” she wrote. “Through it all, no one seemed to wonder why people trying to coordinate protests in Iran would be writing in any language other than Farsi.”

Gladwell shows how the civil rights movement in the United States in the 1960′s worked, in some detail. The volunteers had very strong personal connections. What they were doing was extremely risky and required a lot of physical courage. To make this work required work like a military campaign, with strong central leadership and careful planning.

Today, the New York Times ran an article about how the present Egyptian uprisings started. It fits the very model that Gladwell described a few months ago: a small central group, starting with careful planning and pre-testing. Anyone looking at what’s been happening in Egypt for the last few weeks could easily think that it was all disorganized and spontaneous. But the organization wasn’t easily appearent; in fact, the organizers kept their activities secret as long as possible, for obvious reasons. It’s only now that we are finding out in more detail what happened.

I see this as a vindication of, or at least a strong data-point in support of, the argument Gladwell put forth in his article. As Gladwell points out, social media such as Facebook, with their weaker forms of more-widespread links and decentralization are excellent for some purposes, just not for these, and we should not jump to conclusions.

By the way, when the protests started, I was not optimistic about success. Success is very difficult. I expected brutal retaliation followed by the regime’s becoming even more autocratic and harsh. I am very surprised and very pleased at what has happened. The people of Egypt have worked very hard and taken considerable risk to make this happen. As I write this, the news is that Mubarak is still refusing to step down, despite expectations earlier today that he would do so. He refused to give any assurances that he would step down in September. It’s hard for me to see why any such “assurances” would carry weight anyway; if the protests stop, he can just carry on business as usual.

Not surprisingly, he is blaming “foreigners” despite the obvious fact that the protesters are ordinary Egyptians. This is very, very different from foreign overthrows such as those of Mohammad Mossaddegh in Iran in 1953, or of Jacobo Arbenz in Guatamala in 1954, which were staged primarily by the C.I.A. I have read quite extensively about those events and there is absolutely no way that the present events in Egypt match that pattern.

I continue to hope for their success.

Improving the PACELC Taxonomy

January 12th, 2011
news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

Daniel Abadi of Yale published a blog article last April criticizing the characterization of distributed storage systems using “you only get two of C, A, and P”. He proposed a new taxonomy with the acronym PACELC, to be read “If P, then trade off A for C; if E, trade off L for C”, with this meaning:

  • When there is a partition, how does the system trade off between:
    • Availability and
    • Consistency
  • Else, when there are no partitions, how does the system trade off between:
    • Latency and
    • Consistency

For example, a system characterized as PA/EL means that in the face of partitions, it favors availability over consistency, and if everything’s working, it favors low latency over consistency.

I think this is moving very much in the right direction, and I hope I can contribute and help develop these ideas a bit.

Problems with the “Proof of the CAP theorem”

The “CAP” characterization has a lot of problems. It is especially poorly applied, if not actually misused, when someone trots out the “proof of the CAP theorem” to show how they were forced into a tradeoff. While the proof is correct, what is proves is too crude to model what we really care about.

I discussed the proof in an earlier post. In the proof, each attribute is problematic:

  1. “Consistency” means the behavior that would have happened on a single server that never crashed and did each operation in serial, which is fine, but lack of consistency means that the system makes no guarantees or representations about the result of an operation whatsoever.
  2. “Available” means that you get an answer “eventually”, but since eventually can mean any amount of time (a trillion years), there’s no practical difference between A and not A.
  3. “Partition-tolerance” is never actually defined in the paper.

“E” implies “A”

The reason the acronym doesn’t need to be “PACELCA” is that if there are no partitions, then the system must be available. Adding an “A” to the second part is redundant. But for me (maybe not for you), putting in the redundant “A” in the “E” case helps me. A PA/EL system is always “available”, and calling it PA/ELA makes it easier for me to see that availability is always there.

How do Availability and Latency relate?

Consider what “highly available” and “low latency” mean. They are not entirely distinct and orthogonal. The only useful meaning of “A” is that the system replies within a maximum latency. It could be something like “response within 10ms at least 90% of the time and within 100ms in any case” rather than a simple deadline. We can call this “fast enough” to meet the system requirements. So availability is about latency.

There is, however, an important practical difference. “Available” refers to a system’s latency related to the amount of time it takes to repair a partition.

To see this, consider two web sites (with human users) that are based on a system that can have partitions:

  • The operators of the system move so quickly that they always fix partitions within 10ms. The system is “available” even in the face of any single partition, without any special mechanism to be “partition tolerant”.
  • The operators of the system move so slowly that it takes them five minutes to fix a partition. If the system has no way to be “partition tolerant”, it’s not available.

Latency (the “L” in PACELC) has nothing to do with repair time, since it only applies when there are no partitions. A web site is far better with a (maximum/average/whatever) latency of 10ms than with 1000ms.

So “A” and “L” are different. But, that said, even if a system meets its “A” (fast enough) requirement, it can be valuable to lower the latency below that requirement. The “PAC” characterization does not take this into account.

PC/EL is confusing

If a system is consistent when there are partitions, then surely it’s also consistent when there aren’t any partitions. If the components work better, the service should not be worse.

At first glance, this seems to mean that “if PC, then EC”. That would mean that PC/EL can’t describe any realistic system, but Prof. Abadi characterizes PNUTS/Sherpa (as originally presented). I’m sure that there isn’t really a paradoxical situation with any real system, but rather that there is a way to misinterpret the PACELC notation. What do PC and EL really mean?

PC means that if a client sends a request when there are partitions that prevent the system from answering promptly and correctly, then the system does not answer, rather than providing an answer that might be incorrect. Indeed, it might not be able to reply at all, since a total failure is a kind of partition, and there just isn’t anybody to send back a reply.

EL means that if a client sends a request, and the system can choose between waiting a longer time to send a consistent answer, versus waiting a shorter time to send an inconsistent answer, it chooses (or tilts toward) the latter.

Loose ends

  • What does “C” really mean? Can’t we say something better than “we don’t guarantee consistency”? Dynamo can give you answers that are not definitive but are very useful, with semantics that the application can understand. What about “eventual consistency”?
  • What about durability? There’s a big difference between some data being temporarily offline versus data being lost forever. Some systems use “commits over a WAN” to replace the use of disks, and then the tradeoff of latency versus correctness, from synchronous to asynchronous commits, is important.
  • Should be distinguish between “available for read” vs. “available for write”? This can come up in, e.g., a master-slaves configuration.
  • Stay tuned.

    Errors in Database Systems Still Must Consider Network Partitions

    December 14th, 2010
    news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

    Prof. Michael Stonebraker wrote a paper, published in the April 2010 issue of CACM, entitled “Errors in Database Systems: Eventual Consistency and the CAP Theorem”. As I see it, the overall point of the paper is that the kinds of failures that cause partition-tolerance problems are rare, and not too significant compared to the other ways that a DBMS can fail. Therefore, people are worrying too much about the CAP issue, specifically network partitions.

    The paper enumerates eight causes of DBMS failure as seen by an application, such as software errors in the DBMS itself, operating system errors, and so on. Number six in his list is “a network partition in a local cluster”, and number eight is “a network failure in the WAN connecting clusters together; the WAN failed and clusters can no longer all communicate with each other”. (The usual reason one would have multiple clusters connected by a WAN is for “disaster recovery”, i.e. dealing with a problem that causes an entire cluster to fail, such as power loss over all hardware on which the cluster depends.)

    Number six is the crucial issue in the paper, as far as the CAP issue goes. He has two answers:

    1. In my experience, this is exceedingly rare, especially if one replicates the LAN (as Tandem did).
    2. The overwhelming majority [of local failures] cause a single node to fail, which is a degenerate case of a network partition that is easily survived by lots of algorithms.

    About #1, I have spent some time talking to the operations architects at ITA Software, which has run high-availability servers for many years now, and heard about their experience. It depends on what you mean by a “LAN”. If you mean a few computers connected together by an Ethernet, with redundant hardware all around, then the chance of a failure of the network itself is relatively low. However, real-world data centers with a relatively large number of servers rarely work this way. The problem is that a real network is very complicated. It depends on switches at both level 3 (routers) and level 2 (hubs). Situations can arise in which pieces of the network are mis-configured by accident; these can be hard to find due, ironically, the very redundancy that was added to avoid failures. In particular, there is no way to make any kind of guarantee about the latency within the network, nor the likelihood that a packet will make it from its source to its destination.

    About #2, it would have been helpful if there were citations in the paper. It is hard to reply to such a claim without specifics. One of the techniques to deal with one server being down involved “quorums”, but they can introduce problems with high-availability.

    But, more importantly, consider some of the failure modes that Amazon’s “Dynamo” highly-available key-value store is built to deal with. Suppose we have a key-value pair that resides on two servers, for high availability. Call them A1 and A2. An application changes the value of the pair, but does so at a time with A1 is down or unreachable. So the update is made to replica A2. Later, a second application reads the value associated with the key, but this time server A1 is down or unreachable, and server A2 is available. The second application might not see the new value written by the first application. I don’t know any of “lots of algorithms” that deals with this sort of scenario while providing complete consistency/correctness.

    The conclusion of the paper is: “In summary, one should not throw out the C so quickly, since there are real error scenarios where CAP does not apply and it seems like a bad tradeoff in many of the other situations.” But since network-partition failures really can happen, it’s not clear that one can simply decide not to throw out the consistency/correctness criterion.

    New Ruling from the FCC Affecting Net Neutrality

    December 3rd, 2010
    news and informationbusiness,health,entertainment,technology automotive,business,crime,health,life,politics,science,technology,travel

    Some time ago I published some blog entries about net neutrality, and the actions of the FCC and the carriers. Yesterday there was a new announcement about these topics, which have many Internet analysts and net neutrality advocates quite upset. The issues are, unfortunately, hard to summarize, so I’ll just provide links to some good things to read.

    This article in Wired Magazine is a good place to start.

    The negative comment from Marvin Ammori, quoted in the Wired story, comes from this story published (perhaps in other places as well) in the Huffington Post, with a lot of analysis.

    There is more analysis here, from Josh Silver of Free Press:

    and more from him, the day before the announcement:

    This article shows how happy the carriers are with the announced plan.

    For some historical perspective, see this story from the Washington Post.

    This is another attempt to explain what is all means, from CNBC.

    This is an article published in Atlantic Monthly opposing the proposal.

    What I don’t quite get is why prominent Silicon Valley investors are happy about this. This one quotes Ron Conway, the most famous of the Silicon Valley “super-angel” investors: John Doerr, perhaps the most famous VC anywhere, is also in favor, although I could only find that short quote.

    I have been told that Google and other internet companies are running full page advertisements asking Obama to live up to his commitment to net neutrality.