User-Kernel Interactions

The DLM currently has pieces in user-space and in the kernel. The guts are all in the kernel, but the user-side is nothing to be laughed off:

Dynamically-loaded code

The current user-side code uses dynamic loading for a few different reasons:

Linux changes

Proposed changes to the DLM for the Linux release regarding dynamic loading:

Connections user<->kernel sides

This describes the many and varied ways that the user-side and kernel-side trade information.

Linux Changes

The changes in this area will be widespread, and reach a bit into the Cluster Manager realm. We'll talk connections here, and Cluster Manager detail later.

The overall change here is to eliminate the multiple paths currently used between the user and kernel side, and unify all of it under a device model. This is shamelessly lifted from the Sequent Lock Manager model, as well as in keeping with other similar functions in the Linux world. To wit:

User-side API Library

Overall then, by eliminating the "which HACMP" issue and the need for linkage protection due to the system calls, we can get rid of all dynamic loading in the user-side API library. In this way, it can determine if the DLM is not configured by testing for the /dev/TheOneTrueDLM/locks device, if not present it returns an error. This can be done with normal system facilities, and thus requires no fancy linking. If it is configured, but the driver (cllockd.x)is not loaded, we can either have the kernel auto-load cllockd.x or an error will be returned.

Given the layering of the API library, we should be able to mostly just "slide in" these changes, by hacking up only the lower layers:

The user-side library works by having the locking APIs call a smaller set of interfaces which "package" (struct transaction) the requests for the going through the system calls. Thus, all of the various "lock" calls (clmlock, clmlockx, clmlock_sync) all go through clm_send(). The upper-most APIs essentially just parse their data for validity (where they can) and then the data is passed down.

Kernel-side API

Technically, it isn't a library we provide, simply a set of function calls in the complete DLM module package that get loaded when cllockd.x gets loaded. Note that it is possible for the kernel-side clients to go through /dev/TheOneTrueDLM/locks, but that is probably overkill. Rather, they should verify their data and then call the set of locking interfaces, which directly call the appropriate cllockd.x functions directly.

As above, specify via the "flag" that this is a kernel-side client.

When a kernel-side client (module) that has been built against our API is loaded, it will normally force the DLM module to likewise be loaded. See for considerations for the DLM to be built into the kernel.

cllockd.u <-> cllockd.x

We basically view cllockd.u as modifying its role from simply being an agent to load the cllockd.x module, and assuming various of the roles of the HACMP Cluster Manager as well:

We will get rid of both the socket and message queue currently used by the HACMP Cluster Manager. cllockd.u will use the /dev/TheOneTrueDLM/admin device to send data to cllockd.x, by opening it and using writes or ioctls to pass the data in. Alternately, we can use the /proc interfaces, but it strikes me that I'd like to keep /proc read-only, and use the /dev/TheOneTrueDLM/ interface for the only "write" path down.

Since cllockd.u will open /dev/TheOneTrueDLM/admin, cllockd.x can monitor it for liveness, if the device gets closed then we know that cllockd.u has snuffed it, this replaces the deadman fuction of the current socket. Death of the Cluster Manager currently causes the DLM to kill all clients, and to shut itself down. This will be similar with cllockd.u, as loss of it means loss of our connection to the cluster infrastructure.

In-kernel cluster infrastructure

It should also be reasonable to support interactions with in-kernel cluster infrastructure, e.g., the Sequent cluster code. In this case the cllockd.u code would read the config, pass it to the cllockd.x module, and then tell it to "talk to the kernel." At this point, we could have cllockd.u exit, or stay running.

The bigger piece is to provide interface module code similar to that used by cllockd.u. The cllockd.x code would have to force the load of an appropriate interface kernel module to interact with the in-kernel cluster infrastructure code. Nothing impossible, just not work we'll do in the immediate term.

Key point is that this in-kernel interface module can essentially call the same routines as used for info coming from user-space, but it won't come through the device (or, it could if we so choose.)

Building the DLM into the kernel

While we disclaim forcing anyone to rebuild their kernel for the DLM to be dynamically loaded as well as not forcing them to actually build it into the kernel, some installations may want to do it just to avoid hassles with dynamic loading. We should probably assume that we'll eventually support this. This would be to allow kernel-side clients that are built-into the kernel themselves.

A primary issue relates to various code structural issues must be addressed, to account for slight differences in the two environments (dynamic module vs. built-in.) In addition, code to call the DLM cllockd.x initialization function needs to be added to the kernel.

However, for the built-in DLM to actually "function" requires that we actually form a cluster:

Cluster Manager

We use the term Cluster Manager to refer loosely to the "cluster infrastructure" on which the DLM depends for cluster configuration information. The DLM requirements here are relatively simple:

  1. Static info, the list of nodes in the cluster, i.e., this defines the "boundaries" of the cluster:
  2. Dynamic info, the status of each node and IP address:
    1. When a node goes down, or when a node comes up, inform the DLM.
    2. When a node's IP address changes, inform the DLM.

"Static" above should therefore be viewed loosely, the DLM currently deals with a maximum of eight (8) nodes, but doesn't really care how many are around otherwise. Also, the node identifier space can be sparse. All of this really means is that the DLM will deal with whatever nodes "show up" so long as they have a DLM active on the node for us to work with.

The DLM essentially considers a node to be "up" to mean that the IP address we've been given is active and we can expect to send messages to that address and receive them from it. Thus, so long as we are told a node is up, we will expect it to participate in the DLM protocols. If the node (or the given IP address) is actually dead, the DLM will "hang."

This then states simply what the Cluster Manager must do for us. It must determine the set of active nodes and IP addresses, and maintain a mapping among IP addresses and nodes and feed changes in this information to the DLM.

Current Cluster Manager Usage

The DLM is currently dependent upon HACMP to feed this information to cllockd.x. HACMP in facts does the following:

Note that we really don't have to touch much any of this function in the DLM for the Linux port. The problem is not that we cannot continue to have the core DLM code simply continue operating exactly as it does today.

So, what then is the problem?

Earlier we described the 'mechanics' of passing data between the Cluster Manager and cllockd.x. Thus, it's easy to abstract the collection and passing of node and interface status into cllockd.u, and we can even handle interfacing with various different cluster infrastructures rather cleanly through dynamic loading.

The problem is the cluster infrastructure.

Linux Changes

Configuration File

At this point we basically assume we'll need configuration file of our very own. Currently HACMP has the configuration and the bits needed by the DLM are predigested and sent in. Looking at the above cluster infrastructure discussion, we need:

We will have cllockd.u read and digest this file. It will be used to interpret the data received from the cluster infrastructure and pass on the appropriate information to cllockd.x and libcccp. Ian has already downloaded open source code that can parse such a file for us. We leave it to the admin to keep the file consistent on all nodes in the cluster.

Cluster Infrastructure

I'll go look again at FailSafe. Ian will use his hardcoded config file glue to start playing with libcccp. Ian will think a bit on libcccp changes.

I'll lift existing code to talk to RSCT, that will allow us to run on >2 node clusters, as well as clusters that aren't on a single (broadcast-capable) subnet. I'll also look to integrate the Heartbeat code into cllockd.u while I'm there.

Stuff Not Yet Covered Here

You may notice I've said nothing yet about some things:

Basically, we'll deal with these as we come to them.

Overall Work Plan

I plan to start immediately attacking the cllockd.u and cllockd.x code that manages the "device" (/dev/TheOneTrueDLM/) control. Actual name TBD. For now I'll interface with RSCT since the current cluster we have is not on a single subnet.

Next is to attack the API code enough to allow it to use the device interface and to allow us to have a client to see if we can actually lock anything...

Ian will continue on libcccp, explore modifying it to do liveness determination, and as that comes together look at the kernel-side API.

All of the above subject to change, of course.