The remainder this document is divided into two major sections. In
section 2, the current implementation of the
recovery state machines is detailed. Additionally, the problems with this
implementation will be described. In section 3 the
proposed reimplementation of the state machines is described.
2. The Current Implementation
The current implementation uses two tightly coupled state machines. In the
remainder of this document, the state machine implemented in clm_main.c will
be refered to as the CLM machine, and the state machine implemented
in clm_reconfig.c will be refered to as the RC machine. These names
were chosen based on the prefixes used for the names of the states in each
machine. That is, all state names in the CLM machine begin with
CLM_, and all state names in the RC machine begin with RC_.
2.1 The CLM Machine
The CLM machine, in terms of the number of states, if far smaller than the
RC machine. There are only 8 states, 3 of which are only used during
lock-manager startup and initialization. Roughly, the states are:
Each time any sort of input is processed by the main loop in clm_master_loop (in clm_main.c) the CLM machine gets "executed." For the most part, the input to the state machine will just be a replay of the previous input. To make progress, the machine relies on changes to global data made by other input sources (such as the call-back routines registered with CCCP).
Under AIX this technique worked quite well because CCCP was polled for
input. For the Linux port, CCCP executes as a separate daemon. The net
result is that the main loop does not automatically get executed each time a
CCCP message is delivered. This poses a significant problem, as has been
discovered by several of the developers on the DLM project. Since the main
loop does not "wake up" when a CCCP message is processed, it can never
respond to that input, and the CLM machine can never make progress.
2.1.2 Interaction with the RC Machine
Very little work is done in the states of the CLM machine. For virtually
all states, the input is simply passed into the RC machine. This can be
seen in the repeated calls to clmr_node_reconf throughout the code. Not
only that, but the return value of clmr_node_reconf is used to determine
when state transitions should happen in the CLM machine.
2.2 The RC Machine
The RC machine has many more states than the CLM machine, 18 in all. The
states break down into about the same general groups as the states of the
CLM machine:
It is very important to notice (and not trivially obvious at first glance) that each of the states of the RC machine can only happen during certain states of the CLM machine. For examply, the RC machine can only be in the RC_PURGEWAIT state when the CLM machine is in the CLM_ST_DRAIN state. In fact, with the exception of the RC_IDLE state, each state of the RC machine can only be active when exactly one other state in the CLM machine is active.
When the two state machines are redrawn (NOTE: Figures are comming soon), it becomes obvious that the states of the RC machine are actually substates of the CLM machine.
To further enforce this notion, when certain state transitions occur in
the RC machine (such as the RC_PURGEWAIT to RC_CLEARLOCAL transition), a
special return code is sent back to the CLM machine to cause a transition in
that machine.
2.2.2 Use of Global Data
One thing that makes the operation of the RC machine non-obvious is its use
of global data. There are numerous places where a message is received from
CCCP, a global data structure is updated, and the RC machine is kicked via
the CLM machine. By examining the global data, without any explicit input,
the machine makes progress.
For example, several of the states, such as RC_DRAINPOINT and RC_SYNC_3,
implement a barrier mechanism (in the code this is called a "sync point").
When a barrier message is received, clm_inband_sync (clm_msg.c, line 477) in
invoked to update a global table. The CLM machine will then invoke the RC
machine, which will check the state of the table by calling syncd_up
(clm_msg.c, line 321).
3. The New Implementation
The design of the new implementation centers around replacing the two
existing state machines with a single heirarchical state machine (HSM
hereafter). The HSM is implemented using a fairly generic data (i.e., one
that could be used by other state machines in the DLM or elsewhere in the
kernel).
Additionally, instead of having a single entry point to the state
machine, input will be send into the HSM each place where it is generated.
This eliminated the use of the vast majority of the global data queries used
by the old state machine.
3.1 Implementation of the Heirarchical State Machine
The generic HSM implementation is based on [1]. There
are a number of minor and a few major changes from the original
implementation.
void hsm_start(hsm_t * me); void hsm_process_event(hsm_t * me, hsm_msg_t const * msg); void STATE_START( hsm_t * me, hsm_state_t * target ); void STATE_TRAN( hsm_t * me, hsm_state_t * target ); hsm_state_t * STATE_CURR( hsm_t * me ); |
Figure 1. Interface functions to the HSM datatype |
To send input to the state machine, hsm_process_event is called. In [1], the equivalent function performs direct processing of the event by passing it to the handler for the active state. This poses two problems for the DLM. First, the handler functions for states can generate additional input to the state machine. This could result in hsm_process_event being recurrsively called. This could cause a variety of problems and incorrect behaviors.
More importantly, the DLM is a multithreaded piece of software. Since input is to be fed into the state machine from a number of sources, entry to hsm_process_event must be protected by a mutex at the very least. On Linux, there are basically two ways to do this. A data structure can either be protected by a spin-lock or a semaphore. Both of these options have problems. On a uniprocessor system, the spin-lock is essentially a no-op, and would be the best choice. However, on an SMP system, a process could spin for quite a long time (perhaps many milliseconds), and overall system performance would be degraded. For an SMP system, a semaphore would be a better choice since it would allow the process to give up the CPU. The semaphore would result in degraded performance on UP systems.
The better choice is not to use either one on its own. Mutual exclusion in the HSM datatype is achieved through three fields in the type. The event_queue is a FIFO queue of events to be processed by the HSM. The queue is protected by the event_mutex, which is a simple spin-lock. The event_delivery_flag is set whenever an event is being processed by the HSM. Figure 3 shows the pseudocode for the top level event processing function.
struct hsm { /* Pointer to static name of the state machine. */ char const * name; /* Current state */ hsm_state_t * curr; /* If non-NULL, this is the next state. */ hsm_state_t * next; /* Top-most state object */ hsm_state_t top; /* The mutex protects the queue and the flag. Events are added to the * tail of the queue, and removed from the head. If the flag is set, * then another process is already processing an event. */ spinlock_t event_mutex; dlm_list_t event_queue; atomic_t event_delivery_flag; }; |
Figure 2. The HSM datatype |
lock( hsm->event_mutex ) add_to_queue( hsm->event_queue, new_event ) can_deliver = atomic_test_and_set( hsm->event_delivery_flag ) unlock( hsm->event_mutex ) if ( can_deliver ) forever lock( hsm->event_mutex ) event = dequeue( hsm->event_queue ) if ( event == NIL ) atomic_clear( hsm->event_delivery_flag ) unlock( hsm->event_mutex ) return endif unlock( hsm->event_mutex ) deliver_event( hsm, event ) done endif |
Figure 3. Event delivery pseudocode |
There is a drawback to this approach. In a state machine that is
receiving a large volume of input, it is possible for one process to get
stuck in the processing loop for an arbitrarilly large amount of time. For
the specific cases where this datatype will be used in the DLM, this is not
an issue, but it might be for other users of the datatype. Since the kernel
is not preemptive, it would probably require a fairly large SMP machine
(i.e., more than 4 processors) for a process to really get stuck in the
processing loop. In this case it would probably be better to change the
implementation so that there is a single thread that only processes input to
the HSM.
3.2 The Modified State Machine
The dual state machine architecture of the original DLM implementation is
replaced with a heirarchical state machine. The 8 states of the CLM machine are replaced by 5 "top level" states in
the HSM. Three of the top level states have substates. The 13 substates
roughly match with the 18 states of the RC machine.
3.2.1 CLM_ST_INIT
At startup, the DLM is in the CLM_ST_INIT state. Specifically, the
CLM_ST_INIT/RC_INIT_WAIT state will be active. As with the original DLM
implementation, the DLM will wait in this initial state for a
CLM_DO_NODE_INIT message from the user-mode daemon. When this message is
processed by the state machine, either the RC_DIR_WAIT or CLM_ST_RUN state
will become active. As in the original DLM implementation, this decision is
based on whether or not the node is alone in the cluster.
This is a fine point of the DLM that needs to be explicitly explained. When a node starts up, the DLM expects to receive directory information from some node already in the cluster. If the node is alone in the cluster, it will decided that there is no state available, and will move on. This is the decision made here when the CLM_DO_NODE_INIT message is processed. The DLM therefore has very specific expectations of the cluster manager when multiple nodes are booted to form the first cluster. The expectation is that the cluster mananger will only allow one node to join the cluster at a time. This provides a guarantee that one of the nodes will see that it is alone in the (initial) cluster, and it will be able to make progress.
On the transition into the CLM_ST_INIT/RC_DIR_WAIT state the DLM will send a request for directory information from some other node in the cluster. The algorithm used to select the node is not significant in this context.
Once in the CLM_ST_INIT/RC_DIR_WAIT state, the state machine will process
messages containing directory information and notifications that nodes have
left the cluster. The machine will remain in this state until it either
receives all of the directory information or all of the other nodes have
left the cluster. The state machine will transition to CLM_ST_RUN.
3.2.2 CLM_ST_RUN
The CLM_ST_RUN state is the normal, steady state of the DLM. Once this
state becomes active, it will remain in this state until there is some
change in the state of the cluster. There are three classes of cluster
state change from the point of view of the state machine, and each of these
determines the next state of the HSM and what action will happen.
Message | Next State | Action | |
---|---|---|---|
NODE_DOWN | Run clmr_state_0 to start recovery | CLM_ST_DRAIN/RC_BARRIER1 | |
NODE_{ADD,PURGE,DELETE} | None | CLM_ST_DRAIN/RC_BARRIER1 | |
NODE_UP | Start sending directory information to the new node | CLM_ST_SENDDIR |
The two most common messages are NODE_UP and NODE_DOWN. The remaining
messages are only used when a node is removed or added to the cluster
configuration. This different from the cluster state, and is not
currently changable (at runtime) in the Linux implementation of the DLM.
3.2.3 CLM_ST_SENDDIR
The CLM_ST_SENDDIR state is the co-state for CLM_ST_INIT/RC_DIR_WAIT. If
one node is in the CLM_ST_INIT/RC_DIR_WAIT state, the expectation is that
some other node will be in the CLM_ST_SENDDIR state. On entry to
CLM_ST_SENDDIR, the node will send a block of directory information to the
node that has just joined the cluster.
The node will then wait for an ACK from the node to which the directory information was sent. If there is more directory information to be sent, the node will stay in the CLM_ST_SENDDIR state and will send the next block of directory information. If there is no more directory information to be sent, the DLM will return to CLM_ST_RUN.
Note: At the time of this writing, I am not sure what
happens if a node leaves the cluster before all of the directory information
has been sent to it. I believe that the DLM will immediatly
transition to the CLM_ST_DRAIN/RC_BARRIER1 state.
3.2.4 CLM_ST_DRAIN
The CLM_ST_DRAIN state is always entered with the the RC_BARRIER1 state.
This acts as a synchronisation point so that no node will move past stage 0
recovery. Once the barrier has been reached by all nodes, the
CLM_ST_DRAIN/RC_PURGE_WAIT state will become active.
On entry to the RC_PURGE_WAIT state, a purge operation will occur. The
actual operation of the purge is not important for this context. When a
CLM_PURGE_DONE is processed by the HSM, the machine will move to the
CLM_ST_RECONFIG/RC_BARRIER2 state. The CLM_PURGE_DONE message simply means
that the local node is done purging.
3.2.5 CLM_ST_RECONFIG
The CLM_ST_RECONFIG is the most complex and most significant of all the
superstates. It makes use of 9 substates. The first of these states is
always the RC_BARRIER2 state.
The RC_BARRIER2 state is simply a synchronization point where the cluster will wait for all nodes to finish the purge operation. Once all nodes have reached the barrier, either the RC_PURGE_EXIT or the RC_DIR_CLEAN state will become active. The RC_PURGE_EXIT state becomes active if the reconfiguration operation was started by a NODE_PURGE message.
Upon entering the RC_PURGE_EXIT substate, the DLM will perform a purge_locks operation. The RC_PURGE_EXIT state then acts as another barrier. Once this barrier is complete, the DLM will perform a clmr_state_6 operation and return to the CLM_ST_RUN state.
The RC_DIR_CLEAN state is somewhat more complicated. On entry to this stage, and each time an ACK is received, clmr_stage_2 is called. This operation removes all non-local entries from the local directory. Each time an entry is removed from the directory, a message is sent to the master node. An ACK is expected for each message.
Once all of the non-local directory entries have been removed and all of the ACKs have been received, the RC_BARRIER3 state will become active. Here the DLM will wait for all of the nodes to complete the RC_DIR_CLEAN state. Once the barrier has been reached, the RC_RESOURCE_QUERY state will become active.
The RC_RESOURCE_QUERY state is quite a bit like the RC_DIR_CLEAN state, expect that clmr_stage_3 is performed instead of clmr_stage_2. In fact, clmr_stage_3 does pretty much the opposite of clmr_stage_2. Instead of clearing all of the non-local directory entries, the director for each lock is querried to determine which node is the master of the lock. If the local node is not the master of the local, all of the locally known information about the lock is sent to the master. Again, ACKs are expected for this information. When all of the locks have been processed and the last ACK has been received, the RC_BARRIER4 state becomes active.
The DLM waits in the RC_BARRIER4 state for all nodes to complete the RC_RESOURCE_QUERY state. Once all nodes have reached the barrier, clmr_stage_4a is performed, and the RC_BARRIER5 state becomes active. Once all nodes have reached RC_BARRIER5, RC_WAITDIR (this was RC_WAITDIR3 in the original implementation) becomes active.
On entry to RC_WAITDIR clmr_stage_4b is executed. According to the comments in the code, "this function removes directories temporarily created for local locks during reconfig." This is another function that is executed repeated on a few entries at a time. However, since this function does not send any messages or do a lot of processing on each entry, it seems like all of the entries could be processed at once. This would allow this state to be merged with the next state, RC_BARRIER6.
Once all nodes have reached RC_BARRIER6, clmr_stage_6 will be executed
(there is no clmr_stage_5 or clmr_stage_1), and the DLM returns to
CLM_ST_RUN.
3.3 Open Issues
There are a few open issues that will need to be resolved at some point.
This list is currently "evolving" as the design is refined.
3.3.1 Barrier State Naming
In the original state machine implementation there were a number of states
that acted purely as synchronisation points. By just looking at the code,
it was not always obvious which states were which. In the new
implementation, the term "sync point" has been replaced with barrier, and
all states that act as barriers are named as such.
The original implementation also used a table mechanism to determine
which states were barriers. In the new code, a barrier state explicitly
waits for a barrier message. Since this makes things much more explicit,
it might be better to name the barrier states differently. Specificlally,
each barrier state has some function that is executed on entry to that
state. It might be better to name the state to reflect the work done in
that function.
3.3.2. RC_WAITDIR
As noted in the description of the
CLM_ST_RECONFIG/RC_WAITDIR state, it might be possible to merge this
state with the RC_BARRIER6 state. The impact that such a change would have
needs to be investigated.
4. References
$Log: recovery_design.html,v $ Revision 1.1 2001/05/09 20:34:25 idr Initial version.