Wednesday, December 20, 2017

Replication and Atomic Actions


Replication can be integrated into atomic actions in two ways: replicas within actions, and replicated actions. These two methods are substantially different from each other and yet attempt to solve the same problem: to ensure (with a high degree of probability) that an action can commit despite a finite number of component failures.

Replicas within Actions

Of the two methods of combining replication with atomic actions, this method is the most intuitive: each non-replicated object is replaced by a replica group. Whenever an action issues a request on the object this is translated into a request on the entire replica group (how the requests are interpreted depends upon the replica consistency protocol). Failures of replicas within a given replica group are handled by the replication protocol in an effort to mask them until a sufficient number of failures have occurred which makes masking impossible, and in this case the replica group fails. When an action comes to commit, the replication protocol dictates the minimum number of replicas which must be functioning in any given replica group for that group to be considered operational and allow the action to commit.


Problems which arise from this method have already been outlined in earlier entries. The replication protocol must be able to deal with multiple messages from both replicated client groups as well as replicated server groups; the majority of replication protocols assume that replicated objects within the same replica group are deterministic; problems can occur in maintaining consistency between replicas when local asynchronous events can occur such as RPC timeouts.


Replicated Actions 


The idea behind Replicated Actions [Ahamad 87][Ng 89] is to take an atomic action written for a non-replicated system and replicate the entire action to achieve availability. In this scheme the unit of replication is the action along with non-replicated objects used within it. In the Clouds distributed system [Ahamad 87] the replicated actions are called PETS (Parallel Execution Threads). By replicating the action N times (creating N PETS) and hence by also replicating the objects which the action uses, the probability of at least one action being able to commit successfully is increased. Although the actions and objects are replicated, each action only ever communicates with one replica from a given object-replica group and does not know about either the other replicas or the other actions until it comes to commit. Subsequently, if this replica fails, so too does the action (as is the case in a non-replicated system).

Because each action only uses one replica in a given group and each replica only receives messages from this one action, the replicas need not be deterministic, and there is no need to devise some protocol for dealing with multiple requests (or replies) from a client (server) group. When the replicated actions commit, it is necessary to ensure that only one action does so (to maintain consistency between the replicated objects). To do this, typically the first action to commit (the fastest) as part of its commit protocol copies the states of the objects it has used to those replicas it did not use (i.e. the replicas used by the other actions) and causes the other actions to abort. This is done atomically, so that if this atomic action should fail then the next replicated action which commits will proceed as though it had finished first.


The figure above shows the state of the objects A, B, C, and D which are replicated three
times, just as the replicated actions, which are also replicated three times, α, β, and ε, begin commit processing. The execution paths of these actions is indicated by the lines.
When actions α and β come to commit they will be unable to do so because the object D (which α used) and C (which β used) have failed, making commit impossible. However, action ε will be able to commit, and will then update the states of all remaining functioning replicas. Failed replicas must be updated before they can be reused.

Obviously the choice of which replicas the actions should use is important e.g. in
In the figure, if action ε had used the same copy of object D as action then the failure of this object would have caused the two actions to have failed instead of one. In some systems [Ahamad 87] the choice of which replica a particular action uses is made randomly because they make the assumption that with a sufficient number of object replicas the chances of two actions using the same copy are small. However, in [Ng 89] they propose a different approach by using information about the nature of the distributed system (reliability of nodes, probability of network partitions occurring) to come up with an optimal route for each action to take when using replicated objects. This route attempts to minimize the object overlap between replicated actions and minimize the number of different nodes that an action visits during the course of its execution.

The advantage of using replicated actions as opposed to using replicas within an action are the same advantages obtained from using a passive replication protocol as opposed to using an active replication protocol: there is no need to ensure that all copies of the same object are deterministic since the action which commits will impose its view of the state of the object on all other copies, and each replica will receive a reduced number of repeated requests from replicated clients (reduced to only one message if each action makes use of a different replica).
However, the scheme also suffers from the problems inherent with a passive replication scheme: checkpointing of state across a network can be a time consuming, expensive operation. If we assume failures to be rare (as would be the case) then this scheme will cause large numbers of actions to abort. The action which commits first will overwrite the states of the other replicas, effectively removing the knowledge that the other actions ever ran. In some applications it may prove more of an overhead to checkpoint the state of one action in this way rather than allowing all functioning actions simply to commit.

No comments: