User-Kernel Interactions
The DLM currently has pieces in user-space and in the
kernel. The guts are all in the kernel, but the user-side is nothing
to be laughed off:
- User-side
- API library (libclm), called by clients.
- Cluster manager interface (HACMP), provides node status and IP
address information.
- Control daemon (cllockd.u, from cllockd/cllockd.c), loads kernel
code, catches signals, stops and unloads kernel code.
- Kernel-side
- DLM State-machine (cllockd.x, from cllockd/ext/Makefile),
the guts of the DLM that handles driving the locking
protocols. Loaded as a dynamic kernel extension (kernel module is the
Linux term, which we'll adopt for the rest of this document.)
- Clever Cluster Communications Package (libcccp), inter-node
DLM communications
- Cluster manager data interpretation (libclminfo), munges data from
the user-side Cluster Manager into a form digestible by libcccp
and cllockd.x. Although logically separate, it is actually code that
is contained in cllockd directory and built into cllockd.x.
- New for Linux: introduce an in-kernel API library, to allow
simultaneous API usage from both user-space and kernel-space.
- In fact, an in-kernel API is of more interest to GFS anyway.
- Structurally, this can "mostly" just use the current calls that go
into the kernel, although some changes are needed since they
expect clients to all be user-space, so each will need to carry a flag.
- New for Linux: use the /proc filesystem to export (and,
maybe even import) data from/to the DLM.
Dynamically-loaded code
The current user-side code uses dynamic loading for a few different
reasons:
- cllockd.u:
- Loads up code that then loads and initializes the kernel extension
(cllockd.x, libcccp).
- Dynamic because it has to work with different versions of HACMP
and each has different paths.
- API, libclm:
- Loads up API glue code.
- Dynamic because it has to work with different versions of HACMP
and each has different paths.
- Additionally, it needs to use dynamic loading to avoid possible
linkage problems if a DLM client starts up and the DLM
is not yet loaded and running.
Linux changes
Proposed changes to the DLM for the Linux release regarding
dynamic loading:
- Eliminate multiple path support, we will keep all of the api glue
code and the kernel module (extension) code in one place and only use
it.
- We will also modify the methods the api code uses to get to the
kernel code (i.e., remove system calls, see
Connections section later) and this will help to eliminate the
worries about linkage problems with the clients.
- However, we will keep one facet of dynamic loading in cllockd.u:
- We will use a configuration file to determine with "which"
Cluster Manager we will be interfacing, and dynamically load
the appropriate module.
- This eliminates linkage worries we would have if we built-in
support for all of the Cluster Manager's to which we envision
connecting.
- See the Cluster Manager section.
This describes the many and varied ways that the user-side and
kernel-side trade information.
- Cluster Manager <-> cllockd.x:
- UNIX-domain socket
- Acts as a 'deadman' to inform both parties if the other has died.
There is a dedicated kernel thread in cllockd.x whose death makes
things very uncomfortable in terms of proper function. Likewise,
losing the Cluster Manager is also bad.
- No data is passed in either direction.
- Message queue
- Cluster Manager uses this to pass the configuration information to cllockd.x.
- When loaded and on its first connection to the message queue,
cllockd.x sends its 'major+minor' numbers to the Cluster
Manager. These identify the release level of the DLM and
is used in various migration and functional scenarios by the
Cluster Manager.
- API library <-> cllockd.x:
- All API requests in AIX go through a set of dynamically-loaded
system calls. These system calls do not have to be in a static
table.
- Oops. Linux doesn't support dynamic system calls. I couldn't
find it in the code, and trolling the Linux kernel mailing list
confirmed this
- We don't want to require a modified kernel (even so "trivial" as
modifying the system call table) so we need something else. Besides,
this isn't trivial, as it could clash with other products, e.g., JFS,
which go out and modify this table.
- Data from cllockd.x to the clients is passed by having the
client code call ASTpoll() and it uses one of the magic system calls
to retrieve ASTs for execution.
- The actual system calls themselves pass only "simple" arguments,
pointers and sizes, and cllockd.x uses copyin()/copyout() (Linux
equivalents copy_from_user()/copy_to_user()) to move the "real" data
between user and kernel space using the given pointers and sizes.
Linux Changes
The changes in this area will be widespread, and reach a bit into
the Cluster Manager realm. We'll talk
connections here, and Cluster Manager detail later.
The overall change here is to eliminate the multiple paths
currently used between the user and kernel side, and unify all of it
under a device model. This is shamelessly lifted from the
Sequent Lock Manager model, as well as in keeping with other similar
functions in the Linux world. To wit:
- Create a device entry, in /dev, e.g., /dev/TheOneTrueDLM/.
- Either assign (when module loaded) a major number or let the
kernel assign us one. In either case, make sure the device entry
matches!
- Beneath this, create "subdirectories", i.e., different minors:
- /dev/TheOneTrueDLM/admin - for the new Cluster Manager
connection.
- /dev/TheOneTrueDLM/locks - for the API path. Alternately we may
have different minors for different clients, e.g., GFS, OPS (that's a
joke son!), and SuperPowerClusterLockClient.
- Eliminate the socket and message queue used between the Cluster
Manager and cllockd.x.
- Pass data using the /dev/TheOneTrueDLM/admin device, can use the
device to signal cllockd.x problems to the user side (i.e., to provide
a deadman-like signal achieved by the socket.)
- Eliminate the dynamic system calls.
- Currently the actual APIs called by clients are NOT the system
calls, those are layered five (yes, 5) layers deep in function calls.
- So, by modifying the layers somewhere, we can have the API go
through the /dev/TheOneTrueDLM/locks device for all of its pleasures
and the client code is nonethewiser.
- This potentially allows us to introduce a modification where
we signal an "interrupt" on the device to a client, to inform them of
pending ASTs to execute, they currently have to keep calling ASTpoll()
and polling, or making synchronous locking calls.
User-side API Library
Overall then, by eliminating the "which HACMP" issue and the need
for linkage protection due to the system calls, we can get rid of all
dynamic loading in the user-side API library. In this way, it can
determine if the DLM is not configured by testing for the
/dev/TheOneTrueDLM/locks device, if not present it returns an error.
This can be done with normal system facilities, and thus requires no
fancy linking. If it is configured, but the driver (cllockd.x)is not
loaded, we can either have the kernel auto-load cllockd.x or an error
will be returned.
Given the layering of the API library, we should be able to mostly
just "slide in" these changes, by hacking up only the lower layers:
- User-side: api_sendpkt(), clm_send(), api_recvpkt().
- Kernel-side: clm_send(), clm_recv()
- Add in a flag or other indication that specifies whether a client
is user-space or kernel-space, the user-space API library routines
simply set the flag on the way through the device.
The user-side library works by having the locking APIs call a
smaller set of interfaces which "package" (struct transaction) the
requests for the going through the system calls. Thus, all of the
various "lock" calls (clmlock, clmlockx, clmlock_sync) all go through
clm_send(). The upper-most APIs essentially just parse their data for
validity (where they can) and then the data is passed down.
Kernel-side API
Technically, it isn't a library we provide, simply a set of
function calls in the complete DLM module package that get
loaded when cllockd.x gets loaded. Note that it is possible for the
kernel-side clients to go through /dev/TheOneTrueDLM/locks, but that
is probably overkill. Rather, they should verify their data and then
call the set of locking interfaces, which directly call the
appropriate cllockd.x functions directly.
As above, specify via the "flag" that this is a kernel-side client.
When a kernel-side client (module) that has been built against our
API is loaded, it will normally force the DLM module to
likewise be loaded. See for considerations for the
DLM to be built into the kernel.
We basically view cllockd.u as modifying its role from simply being
an agent to load the cllockd.x module, and assuming various of the
roles of the HACMP Cluster Manager as well:
- Munge configuration data from our configuration file.
- Interact with the clustering infrastructure to get node and
interface liveness information.
- We anticipate cllockd.u needing to interface with a number of
different clustering packages.
- We'll use dynamic loading of interface code to allow us to load in
environments where one or more of these is missing.
- The configuration file will tell us "which one" and we'll load up
the appropriate interface module.
- The purpose of the interface modules is to take the data as
provided by each cluster package, and munge it into an internal form
that we can pass through to cllockd.x.
We will get rid of both the socket and message queue currently used
by the HACMP Cluster Manager. cllockd.u will use the
/dev/TheOneTrueDLM/admin device to send data to cllockd.x, by opening
it and using writes or ioctls to pass the data in. Alternately, we
can use the /proc interfaces, but it strikes me that I'd like to keep
/proc read-only, and use the /dev/TheOneTrueDLM/ interface for the
only "write" path down.
Since cllockd.u will open /dev/TheOneTrueDLM/admin, cllockd.x can
monitor it for liveness, if the device gets closed then we know that
cllockd.u has snuffed it, this replaces the deadman fuction of the
current socket. Death of the Cluster Manager currently causes
the DLM to kill all clients, and to shut itself down. This
will be similar with cllockd.u, as loss of it means loss of our
connection to the cluster infrastructure.
In-kernel cluster infrastructure
It should also be reasonable to support interactions with in-kernel
cluster infrastructure, e.g., the Sequent cluster code. In this case
the cllockd.u code would read the config, pass it to the cllockd.x
module, and then tell it to "talk to the kernel." At this point, we
could have cllockd.u exit, or stay running.
The bigger piece is to provide interface module code similar to
that used by cllockd.u. The cllockd.x code would have to force the
load of an appropriate interface kernel module to interact with the
in-kernel cluster infrastructure code. Nothing impossible, just not
work we'll do in the immediate term.
Key point is that this in-kernel interface module can essentially
call the same routines as used for info coming from user-space, but it
won't come through the device (or, it could if we so choose.)
While we disclaim forcing anyone to rebuild their kernel for the
DLM to be dynamically loaded as well as not forcing them to
actually build it into the kernel, some installations may want to do
it just to avoid hassles with dynamic loading. We should probably
assume that we'll eventually support this. This would be to allow
kernel-side clients that are built-into the kernel themselves.
A primary issue relates to various code structural issues must be
addressed, to account for slight differences in the two environments
(dynamic module vs. built-in.) In addition, code to call the
DLM cllockd.x initialization function needs to be added to the
kernel.
However, for the built-in DLM to actually "function"
requires that we actually form a cluster:
- The cllockd.u Cluster Manager must be running.
- Cluster heartbeating must be running.
We use the term Cluster Manager to refer loosely to the
"cluster infrastructure" on which the DLM depends for cluster
configuration information. The DLM requirements here are
relatively simple:
- Static info, the list of nodes in the cluster, i.e., this defines
the "boundaries" of the cluster:
- Each having a "simple" identifier, a 'node number' (a short.)
- The IP address that matches to each identified node.
- Dynamic info, the status of each node and IP address:
- When a node goes down, or when a node comes up, inform the
DLM.
- When a node's IP address changes, inform the DLM.
"Static" above should therefore be viewed loosely, the DLM
currently deals with a maximum of eight (8) nodes, but doesn't really
care how many are around otherwise. Also, the node identifier space
can be sparse. All of this really means is that the DLM will
deal with whatever nodes "show up" so long as they have a DLM
active on the node for us to work with.
The DLM essentially considers a node to be "up" to mean that
the IP address we've been given is active and we can expect to send
messages to that address and receive them from it. Thus, so long as
we are told a node is up, we will expect it to participate in the
DLM protocols. If the node (or the given IP address) is
actually dead, the DLM will "hang."
This then states simply what the Cluster Manager must do for
us. It must determine the set of active nodes and IP addresses, and
maintain a mapping among IP addresses and nodes and feed changes in
this information to the DLM.
Current Cluster Manager Usage
The DLM is currently dependent upon HACMP to feed this
information to cllockd.x. HACMP in facts does the following:
- Does heartbeating and all of the necessary protocol gorp to verify
the status of:
- Nodes
- Comm adapters
- IP addresses
- Using this dynamic information, it feeds to cllockd.x:
- Which nodes are defined in the cluster (identified by an integral
node number.)
- Which nodes are up and down (or have just transitioned.)
- Which IP address to use for a node at this point in time.
- The cllockd.x code uses the above information to drive its
communications and during its locking protocols.
- A node is considered up until the Cluster Manager informs
cllockd.x that a node is down. Thus, it must reply and participate in
all protocols.
- The IP address given for a node is used for libcccp's UDP packets,
again until the Cluster Manager informs cllockd.x that a
different IP address should be used.
- HACMP can also quiesce the DLM at certain times, such as
during upgrades or DARE (dynamic configuration changes) operations.
- The DLM during these periods simply queues up requests.
Note that we really don't have to touch much any of this function
in the DLM for the Linux port. The problem is not that we
cannot continue to have the core DLM code simply continue
operating exactly as it does today.
So, what then is the problem?
Earlier we described the 'mechanics' of
passing data between the Cluster Manager and cllockd.x.
Thus, it's easy to abstract the collection and passing of node and
interface status into cllockd.u, and we can even handle interfacing
with various different cluster infrastructures rather cleanly through
dynamic loading.
The problem is the cluster infrastructure.
- Unlike with HACMP, we have no "industrial strength" infrastructure
to work with.
- RSCT (aka, Phoenix):
- This certainly is industrial strength, and in fact is what
supports HACMP itself. This also identifies each node via an integral
node id (node number), scales far past anywhere we care about (512),
and "publishes" IP addresses that match to communications interfaces.
So no worries that this wouldn't be more than sufficient for the
DLM. Plus the pieces we need, TS and GS, are packaged
separately with "simple-minded" config tools, making a smaller package
available. Good.
- But, although it has been under consideration for open source
release, there is no way it will be released when we need it
(Feb. 2001.) Bad.
- We cannot release as open source if we're dependent upon a
proprietary product.
- Heartbeat (Alan Robertson):
- This is open source, is available, runs on most distributions, and
has an API that we can use. Good.
- It is limited to two nodes, is dependent upon broadcast, it uses
node names (uname result, not integral node ids), and uses interface
names (the API doesn't provide IP addresses.) Bad.
- FailSafe (SGI + SuSE):
- This is open source, is available, scales to eight nodes, provides
integral node ids (if I remember correctly), is relatively
straightforward to configure (GUI and/or CLI) and has APIs we can
use. Good.
- It doesn't scale well to eight nodes, I don't remember if it
allows us to directly get IP addresses, it is big and somewhat ugly
since it is a "full failover" product, which means a relatively
big item to load. Bad.
- Modify libcccp:
- libcccp has "most" of the funtionality, add the extra to have it
make the determination if a node is dead or alive.
- Right now, we only communicate once we get lock requests, we could
continue with "lazy" heartbeating, i.e., assume a node is up unless we
need to talk to it, then let the libcccp mechanisms try to get in
touch, if it doesn't return 'in time' then it is dead, drive recovery
and remaster the locks.
- Need an 'instance' number so that we can discover that a node
bounced when we weren't looking.
- 2 nodes with locks, sitting idle.
- One node dies, restarts.
- Second node later tries to do some locking operation, sends
message.
- Gets told that "I am the new me, know nothing of my past life."
- Oops, drive recovery.
- More work necessary to actually figure out how expensive to modify
the libcccp code (and cllockd.x as necessary.)
Linux Changes
Configuration File
At this point we basically assume we'll need configuration file of
our very own. Currently HACMP has the configuration and the bits
needed by the DLM are predigested and sent in. Looking at the
above cluster infrastructure discussion, we need:
- Node name
- Node integral id
- IP addresses for a node
We will have cllockd.u read and digest this file. It will be used
to interpret the data received from the cluster infrastructure and
pass on the appropriate information to cllockd.x and libcccp. Ian has
already downloaded open source code that can parse such a file for
us. We leave it to the admin to keep the file consistent on all nodes
in the cluster.
Cluster Infrastructure
I'll go look again at FailSafe. Ian will use his hardcoded config
file glue to start playing with libcccp. Ian will think a bit on
libcccp changes.
I'll lift existing code to talk to RSCT, that will allow us to run
on >2 node clusters, as well as clusters that aren't on a single
(broadcast-capable) subnet. I'll also look to integrate the Heartbeat
code into cllockd.u while I'm there.
Stuff Not Yet Covered Here
You may notice I've said nothing yet about some things:
- Locking in the kernel routines.
- Linux has it, no problem, map across the AIX calls to the
appropriate Linux calls.
- Logging.
- Current code calls bsdlog(), Linux has syslog(). Arguments are
the same, probably use macro definitions to call the "right" one.
- Tracing.
- Uses the AIX trace support in many places. Nothing equivalent.
For now, have to define out via macros. Use syslog and printk calls
where absolutely necessary for now.
- Memory management.
- AIX allows the DLM to allocate a private heap segment in
the kernel. This is virtual, so doesn't take up space in the kernel
segment, but can (of course) suck up page space if it gets too big.
There is no current requirement for the lock structures kept in this
heap to be pinned. If they're paged, performance tanks, but, if you
have that many locks, oh well!
- Linux doesn't have this support for private heap segments.
- Fortunately, the DLM works by abstracting the memory calls
into its own routines, thus "hiding" the actual memory location and
management from the bulk of the code. Good.
- Modify the memory manager to deal with Linux routines. All memory
will come from the kernel space, which will affect the number of locks
that we'll reasonably support.
- Socket usage in the kernel.
- Current AIX code has sucked some base OS code that deals with
socket reads and writes into the DLM and modified it to accept
calls from kernel space.
- This is a license issue, since these are BSD/OSF files, cannot be
released.
- Besides, Linux provides proper abstractions for using socket
communications within the kernel. Use these.
- Dumping information
- "Hidden" API - dump_resources() - calls into g_dump_resources()
into k_dump_resources() finally into kk_dump_resources(). Uses the
dump file name passed in on the command line.
- Hidden because it isn't listed in any .h file, just in the
api_glue.c file.
- Get rid of this. Rather, this information should be
available through /proc, convert routines to respond to /proc calls.
User can simply copy from there. Does anybody parse output? We can
maintain output, I guess.
- Could modify the routine to be a read of /proc into some file, but
the kernel code would no longer care about this file.
- Orphan interface in cluster/clm.h: clmlockw()
- There doesn't seem to be any code that actually implements this,
and it is not listed in the book or documentation.
Basically, we'll deal with these as we come to them.
Overall Work Plan
I plan to start immediately attacking the cllockd.u and cllockd.x
code that manages the "device" (/dev/TheOneTrueDLM/) control. Actual
name TBD. For now I'll interface with RSCT since the current cluster
we have is not on a single subnet.
Next is to attack the API code enough to allow it to use the device
interface and to allow us to have a client to see if we can actually
lock anything...
Ian will continue on libcccp, explore modifying it to do liveness
determination, and as that comes together look at the kernel-side API.
All of the above subject to change, of course.