Monday, August 29, 2011

Enterprise middleware and PaaS

I wanted to say more about why existing enterprise middleware stacks can be (should be) the basis for realistic PaaS implementations. If I get time, I may write a paper and submit it to a journal or conference but until then, this will have to do. I'm talking about this at JavaOne this year too, so a presentation may well come out soon.

Sunday, August 21, 2011

Fault tolerance

There was a time when people in our industry were very careful about using terms such as fault tolerance, transactions and high availability, to name just three. Back before the Internet really kicked off (really when the web came along), if you were emailing someone then they tended to either be in academia and in which case they'd be summarily shot for misusing a term, or they'd be in the DoD and in which case they'd probably be shot too! If you were publishing papers or your thoughts for wider review, you tended to have to wait for a year to see publication and that was if reviewers didn't shoot you down for misusing terms, and in which case you had to start all over again. So it paid to think long and hard before you did the equivalent of hitting submit.

Today we live in a world of instant publishing and less and less peer review. It's also unfortunate that despite the fact more and more papers, article and journals are online, it seems that less and less people are spending the time to research things and read up on state of the art, even if that art was produced decades earlier. I'm not sure if this is because people simply don't have time, simply don't care, don't understand what others have written, or something else entirely.

You might ask what it is that has prompted me to write this entry? Well on this particular occasion it's people using the term 'fault tolerance' in places where it may be accurate when considering the meaning of the words in the English language, but not when looking at the scientific meaning, which is often very different. For instance, let's look at one scientific definition of the term (software) 'fault tolerance'.

"Fault tolerance is intended to preserve the delivery of correct service in the presence of active faults. It is generally implemented by error detection and subsequent system recovery.
Error detection originates an error signal or message within the system. An error that is present but not detected is a latent error. There exist two classes of error detection techniques: (a) concurrent error detection, which takes place during service delivery; and (b) preemptive error detection, which takes place while service delivery is suspended; it checks the system for latent errors and dormant faults.
Recovery transforms a system state that contains one or more errors and (possibly) faults into a state without detected errors and faults that can be activated again. Recovery consists of error handling and fault handling. Error handling eliminates errors from the system state. It may take two forms: (a) rollback, where the state transformation consists of returning the system back to a saved state that existed prior to error detection; that saved state is a checkpoint, (b) rollforward, where the state without detected errors is a new state."

There's a lot in this relatively simple definition. For a start, it's clear that recovery is an inherent part, and that includes error handling as well as fault handling, neither of which are trivial to accomplish, especially when you are dealing with state. Even error detection can seem easy to solve if you don't understand the concepts. Over the past 4+ decades all of this and more has driven the development of protocols behind transaction processing, failure suspectors, strong and weak replication protocols, etc.

So it's both annoying and frustrating to see people talking about fault tolerance as if it's as easy to accomplish as, say, throwing a few extra servers at the problem or restarting a process if it fails. Annoying in that there are sufficient freely available texts out there to cover all of the details. Frustrating in that the users of implementations based on these assumptions are not aware of the problems that will occur when failures happen. As with those situations I've come across over the years where people don't believe they need transactions, the fact that failures are not frequent tends to lull you into a false sense of security!

Now before anyone suggests that this is me being a luddite, I should point out that I'm a scientist and I recognise fully that theories and practices in many areas of science, e.g., physics, are developed based on observations and can change when they prove to not be sufficient to describe the things you see. So for instance, unlike those who in Galileo's time continued to believe the Earth was the centre of the Universe despite a lot of data to the contrary, I accept that theories, rules and laws laid down decades ago may have to be changed today. The problem I have in this case though, is that nothing I have seen or heard in the area of 'fault tolerance' gives me an indication that this is the situation currently!

Tuesday, August 09, 2011

A thinking engineer

I've worked with some great engineers in my time (and continue to work with many today), and as an aside, I like to think some people might count me in their list. But back to topic: over the years I've also met some people who would be considered great engineers by others, but I wouldn't rate that high. The reason for this is also one of the factors that I always cite when asked what constitute a great engineer. Of course I rate the usual things, such as ability to code, understand algorithms, and know a spin-lock from a semaphore. Now maybe it's my background (I really think not, but threw that out there just in case I'm wrong) but I also add the ability to say no, or ask why or what if? To me, it doesn't matter whether you're an engineer or an engineering manager, you've got to be confident enough to question things you are asked to do, unless of course you know them to be right from the start.

As a researcher, you're expected to question the work of others who may have been in the field for decades, published dozens of papers and be recognised experts in their fields. You don't take anything at face value. And I believe that that is also a quality really good engineers need to have too. You can be a kick-ass developer, producing the most efficient bubble-sort implementation available, but if it's a solution to the wrong problem it's really no good to me! I call this The Emperor's New Clothes syndrome: if he's naked then say so; don't just go with the flow because your peers do.

Now as I said, I've had the pleasure to work with many great engineers (and engineering managers) over the years, and this quality, let's call it "thinking and challenging" is common to them all. It's also something I try to foster in the teams that work for me directly or indirectly. And although I've implicitly been talking about software engineering, I suspect the same is true in other disciplines.

True Grit update

A while ago I mentioned that I was reading the novel True Grit and was a fan of the original film, which I watched when I was a child. I also mentioned that I probably wouldn't be watching the remake of the film as I couldn't see how the original could be improved. Well, on the flight back from holiday I had the opportunity to watch it and decided to give it a go.

I've heard a few things about the new film and they can all me summarised as saying that it was a more faithful telling of the story than the John Wayne version. After watching both, and reading the book, I have to wonder if those reviewers knew WTF they were on about! Yes the new film is good, but it's no where near as good as the original. And as for faithfulness to the book? Well with the exception of the ending, the original film is far closer to the book (typically word for word). While watching the remake I kept asking myself time and again why had they changed this or that, or why had they completely rewritten the story in parts?!

If you have a great novel that you know works well on screen, why do script writers seem incapable of leaving it untouched? Maybe they decided that they had to make the new film different enough from the original so people wouldn't think it was a scene-for-scene copy. But in that case, why remake it in the first place? FFS people: if an original film is good enough, leave it alone and learn to produce some original content for once! And for those of you interested in seeing the definitive film adaptation of the book, check out the John Wayne version.