Friday, November 19, 2010

Iconic distributed systems research comes back around

Distributed systems research has been going on since the very first time someone decided to network multiple computers. Industry and academia have shared this burden and we're here today because of many different people and organisations. Some of this work is often referenced and built on, such as Lamport's paper on Time, Clocks and the Ordering of Events in a Distributed System, or RFC 707, concerning RPCs. But some of it, such as Stuart's work on Coloured Actions, or much work on weak consistency replication.

However, sometimes it's simply a matter of timing, with some research happening before it's really needed or truly appreciated. Case in point is a lot of the work that we saw produced during the mid 1990's on configurable distributed system and particularly that presented and documented by the IEEE Conference on Configurable Distributed Systems (there were other workshops and institutions doing similar work, but this conference was one that I had personal knowledge about since I had several papers published there over the years). Much of this work concerned autonomous systems that reacted to change, e.g., the failure of machines or networks, or increased work load on a given machine that prevented it from meeting performance metrics. Some of these systems could then dynamically adapt to the changes and reconfigure themselves, e.g., by spinning up new instances of services elsewhere to move the load, or route messages to alternative machines or via alternate routes to the original destination thus bypassing network partitions.

This is a gross simplification of the many and varied techniques that were discussed and developed almost two decades ago to provide systems that required very little manual intervention, i.e., they were almost entirely autonomous (in theory, if not always in practice). With the growing popularity of all things Cloud related, these techniques and ideas are extremely important. If Cloud (whether public or private) is to be differentiated from, say, virtualizing infrastructure in IT departments, then autonomous monitoring, management and reconfiguration is critical to ensure that developers can have on-demand access to the compute resources they need and that the system can ensure those resources are performing according to requirements. This needs to happen dynamically and be driven by the system itself in most cases because there should be little/no involvement by your friendly neighbourhood system administrator (in fact in some cases such an individual may not exist!)

I'm hoping that just because Cloud didn't exist as an identifiable concept back in the 1990's, people and organizations today don't overlook the fact that relevant R&D happened back then. Reworking and retasking some of this prior work could help save us a lot of time and effort, even if it's just to convince engineers today that certain paths or possible solutions aren't viable. For a start I know that I'll be getting my copies of those proceedings out again to refresh my memory!

No comments: