Current Code in Progress: Revising VFS-reiserfs interface to not use
leashes, using the Linux dcache code, porting from 2.1.42 to 2.1.57.
After that we will fix NFS interaction, and then simplify balancing code
(we hope to cut it in half by use of careful coding.)
This implementation offers value to the average Linux user, in that it offers higher performance for large and small files (files not near a 4k node in size) than the current Linux file system known as ext2fs. It also saves space to an extent that is important for some applications, and convenient for most.
Since ext2fs exceeds FFS, UFS, and NTFS in performance, the implementation also offers potential value to commercial OS vendors who desire greater than ext2fs performance without directory size issues, and who find its new method of employing preserve lists for preserving meta-data without substantive performance loss to be attractive.
The simpler code of block aligning file systems follows from not needing to create a layering to distinguish the units of the disk controller and buffering algorithms from the units of space allocation, and from not needing to optimize the packing of nodes as is done in balanced tree algorithms. For readers who have not been involved in balanced tree implementations, algorithms of this class are notorious for being much more work to implement than one would expect from their description. Sadly, they also appear to offer the highest performance solution for small files, once I remove certain simplifications from their implementation and add certain optimizations common to file system designs. I regret that code complexity is a major disadvantage of the approach compared to the 6,000 lines of the ext2fs approach.
I started our analysis of the problem with an assumption that I needed to aggregate small files in some way, and that the question was, which solution was optimal? The simplest solution was to aggregate all small files in a directory together into either a file or the directory. This can be effective as a first approximation, and the performance results of [C-FFS] seem to have been more successful than the earlier implementation of the approach performed by [NTFS]. Note that the authors of C-FFS fail to point out that they are comparing their performance to FFS, which is quite unsuitable for small files, especially compared to ext2fs. One cannot tell from their results whether they are faster or slower than ext2fs.
I considered the following in my decision not to use the approach employed by NTFS and C-FFS: Any aggregation into a file or directory wastes part of the last block in the aggregation. What does one do if there are only a few small files in a directory, aggregate them into the parent of the directory? What if there are only a few small files in a directory at first, and then there are many small files, how do I decide what level to aggregate them at, and when to take them back from a parent of a directory and store them directly in the directory. As we did our analysis of these questions we realized that this problem was closely related to the balancing of nodes in a balanced tree. The balanced tree approach, by using an ordering of files which are then dynamically aggregated into nodes at a lower level, rather than a static aggregation or grouping, avoids this set of questions.
In my approach I store both files and filenames in a balanced tree, with small files, directory entries, inodes, and the tail ends of large files all being more efficiently packed as a result of relaxing the requirements of block alignment, and eliminating the use of a fixed space allocation for inodes. I have a sophisticated and flexible means for arranging for the aggregation of files for maximal locality of reference, through defining the ordering of items in the tree. The body of large files is stored in unformatted nodes that are attached to the tree but isolated from the effects of possible shifting by the balancing algorithms.
Approaches such as [Apple] and [Holton and Das] have stored filenames but not files in balanced trees. Subsequent to my releasing on the net my initial results and a detailed description of my design, [C-FFS] embarked on a small-file file system research project that has many ideas in common with ours. They published an excellent discussion of both their approach and why small files rob a conventional file system of performance more in proportion to the number of small files than the number of bytes consumed by small files. It is difficult to compare the packing of the two approaches since they are rather vague in their definitions of their packing algorithms. That said, their use of an exo-kernel is simply an excellent approach for operating systems that have that as an available option. It is unfortunate that they compared their results to FFS rather than ext2fs, as the FFS approach chooses both a synchronization policy and a fragment layout policy that is markedly slower for smaller files. Since ext2fs would also show marked advantages over FFS for small files, one wonders how much advantage they would have over ext2fs. Hopefully they will port to Linux and it can then be measured.
Semantics (files), packing (blocks/nodes), caching(read ahead sizes,
etc.), and the hardware interfaces of disk (sectors) and paging (pages)
all have different granularity issues associated with them: a central point
of our approach is that the optimal granularity of these often differs,
and abstracting these into separate layers in which the granularity of
one layer does not unintentionally impact other layers can improve space/time
performance. Reiserfs innovates in that its semantic layer often conveys
to the other layers an ungranulated ordering rather than one granulated
by file boundaries. The reader is encouraged to note the areas in which
reiserfs needs to go farther in its doing so while reading the algorithms.
Utility and Code Complexity: Allowing OS designers to effectively use a single namespace with a single interface for both large and small objects decreases coding cost and increase expressive power of components throughout the OS. I feel reiserfs shows the effects of a larger development investment focused on a simpler interface when compared with many solutions for this currently available in the object oriented toolkit community, such as the Structured Storage available in Microsoft's [OLE]. By simpler I mean I added nothing to the file system API to distinguish large and small objects, and I leave it to the directory semantics and archiving programs to aggregate objects. Multiple layers cost more to implement, cost more to code the interfaces for utilizing, and provide less flexibility.
Performance: It is most commonly the case that when one layers one file system on top of another the performance is substantially reduced, and Structured Storage is not an exception to this general rule. Reiserfs, which does not attempt to delegate the small object problem to a layer above, avoids this performance loss. I have heard it suggested by some that this layering avoids the performance loss from syncing on file close as many file systems do. I suggest that this is adding an error to an error rather than fixing it.
Let me make clear that I believe those who write such layers above the file system do not do so out of stupidity. I know of at least one company at which a solution that layers small object storage above the file system exists because the file system developers refused to listen to the non-file system group's description of its needs, and the file system group had to be sidestepped in generating the solution.
Current file systems are fairly well designed for the purposes that their users currently use them for: my goal is to change file size usage patterns. The author remembers arguments that once showed clearly that there was no substantial market need for disk drives larger than 10MB based on current usage statistics. While [C-FFS] points out that 80% of file accesses are to files below 10k, I do not believe it reasonable to attempt to provide statistics based on usage measurements of file systems for which small files are inappropriate to use that will show that small files are frequently used. Application programmers are smarter than that. Currently 80% of file accesses are to the first order of magnitude in file size for which it is currently sensible to store the object in the file system. I regret that one can only speculate as to whether once file systems become effective for small files and database tasks, usage patterns will change to 80% of file accesses being to files less than 100 bytes. What I can do is show via the 80/20 Banded File Set Benchmark presented later that in such circumstances small file performance potentially dominates total system performance.
In summary, the on-going reinvention of incompatible object aggregation techniques above the file system layer is expensive, less expressive, less integrated, slower, and less efficient in its storage than incorporating balanced tree algorithms into the file system.
Classically, balanced trees are designed with the set of keys assumed to be defined by the application, and the purpose of the tree design is to optimize searching through those keys. In my approach the purpose of the tree is to optimize the reference locality and space efficient packing of objects, and the keys are defined as best optimizes the algorithm for that. Keys are used in place of inode numbers in the file system, thereby choosing to substitute a mapping of keys to node location (the internal nodes) for a mapping of inode number to file location. Keys are longer than inode numbers, but one needs to cache fewer of them than one would need to cache inode numbers when more than one file is stored in a node.
In my tree, I still require that a filename be resolved one component at a time. It is an interesting topic for future research whether this is necessary or optimal. This is more complex of an issue than a casual reader might realize: directory at a time lookup accomplishes a form of compression, makes mounting other name spaces and file system extensions simpler, makes security simpler, and makes future enhanced semantics simpler. Since small files typically lead to large directories, it is fortuitous that as a natural consequence of our use of tree algorithms, our directory mechanisms are much more effective for very large directories than most other file systems are (notable exceptions include [Holton and Das]).
The tree has three node types: internal nodes, formatted nodes, and unformatted nodes.
The contents of internal and formatted nodes are sorted in the order of their keys. (Unformatted nodes contain no keys.)
Internal nodes consist of pointers to sub-trees separated by their delimiting keys. The key that precedes a pointer to a sub-tree is a duplicate of the first key in the first formatted node of that sub-tree. Internal nodes exist solely to allow determining which formatted node contains the item corresponding to a key. Reiserfs starts at the root node, examines its contents, and based on it can determine which subtree contains the item corresponding to the desired key. From the root node reiserfs descends into the tree, branching at each node, until it reaches the formatted node containing the desired item.
The first (bottom) level of the tree consists of unformatted nodes, the second level consists of formatted nodes, and all levels above consist of internal nodes. The highest level contains the root node. The number of levels is increased as needed by adding a new root node at the top of the tree.
All paths from the root of the tree to all formatted leaves are equal in length, and all paths to all unformatted leaves are also equal in length and 1 node longer than the paths to the formatted leaves. This equality in path length, and the high fanout it provides is vital to high performance, and in the Drops section I will describe how the lengthening of the path length that occurred as a result of introducing the [BLOB] approach (the use of indirect items and unformatted nodes) proved a measurable mistake.
Formatted nodes consist of items. Items have three types: direct items, indirect items, and directory items. All items contain a key which is unique to the item. This key is used to sort, and find, the item.
Direct items contain the tails of files, and tails are the last part of the file (the last file_size modulo 4k of a file).
Indirect items consist of pointers to unformatted nodes. All but the tail of the file is contained in its unformatted nodes.
Directory items contain the key of the first directory entry in the item followed by a number of directory entries. The first item of a file or directory contains its stat data.
A file consists of a set of indirect items followed by a set of up to two direct items, with the existence of two direct items representing the case when a tail is split across two nodes. If a tail is larger than the maximum size of a file that can fit into a formatted node but is smaller than the unformatted node size (4k), then it is stored in an unformatted node, and a pointer to it plus a count of the space used is stored in an indirect item.
Directories consist of a set of directory items. Directory items consist of a set of directory entries. Directory entries contain the filename and the key of the file which is named. There is never more than one item of the same item type from the same object stored in a single node (there is no reason one would want to use two separate items rather than combining).
The first item of a file or directory contains its stat data.
In reiserfs formatted nodes have the property that no node and its two neighbors can be compressed into two nodes. Internal nodes have the property that each node is at least 50% full. This difference in compressibility reflects a different prioritization of space efficiency vs. balancing overhead for the two levels. I feel greater compression is best left to an FS cleaner to perform rather than attempting it dynamically.
It is not necessary for a file's packing to reflect its name, that is merely my default. With each file my next release will offer the option of overriding the default by use of a system call. It is feasible to pack an object completely independently of its semantics using these algorithms, and I predict that there will be many applications, perhaps even most, for which a packing different than that determined by object names is more appropriate.
Currently the mandatory tying of packing locality and semantics results in the distortion of both semantics and packing from what might otherwise be their independent optimums, much as tying block boundaries to file boundaries distorts I/O and space allocation algorithms from their separate optimums. For example, placing most files accessed while booting in their access order at the start of the disk is a very tempting future optimization that the use of packing localities makes feasible to consider.
The Structure of a Key: Each file item has a key with structure <locality_id, object_id, offset, uniqueness>. The locality_id is by default the object_id of the parent directory. The object_id is the unique id of the file, and this is set to the first unused objectid when the object is created. The tendency of this to result in successive object creations in a directory being adjacently packed is often fortuitous for many usage patterns. For files the offset is the offset within the logical object of the first byte of the item. In version 0.2 all directory entries had their own individual keys stored with them and were each distinct items, in the current version I store one key in the item which is the key of the first entry, and compute each entry's key as needed from the one key stored in the item. For directories the offset key component is the first four bytes of the filename, which you may think of as the lexicographic rather than numeric offset. For directory items the uniqueness field differentiates filename entries identical in the first 4 bytes. For all item types it indicates the item type and for the leftmost item in a buffer it indicates whether the preceding item in the tree is of the same type and object as this item. Placing this information in the key is useful when analyzing balancing conditions, but increases key length for non-directory items, and is a questionable architectural feature.
Every file has a unique objectid, but this cannot be used for finding the object, only keys are used for that. Objectids merely ensure that keys are unique. If you never use the reiserfs features that change an object's key then it is immutable, otherwise it is mutable. (This feature aids support for NFS daemons, etc.) We spent quite some time debating internally whether the use of mutable keys for identifying an object had deleterious long term architectural consequences: in the end I decided it was acceptable iff we require any object recording a key to possess a method for updating its copy of it. This is the architectural price of avoiding caching a map of objectid to location that might have very poor locality of reference due to objectids not changing with object semantics. I pack an object with the packing locality of the directory it was first created in unless the key is explicitly changed. It remains packed there even if it is unlinked from the directory. I do not move it from the locality it was created in without an explicit request, unlike the [C-FFS] approach which stores all multiple link files together and pays the cost of moving them from their original locations when the second link occurs. I think a file linked with multiple directories might as well get at least the locality reference benefits of one of those directories.
In summary, this approach 1) places files from the same directory together, 2) places directory entries from the same directory together with each other and with the stat data for the directory. Note that there is no interleaving of objects from different directories in the ordering at all, and that all directory entries from the same directory are contiguous. You'll note that this does not accomplish packing the files of small directories with common parents together, and does not employ the full partial ordering in determining the linear ordering, it merely uses parent directory information. I feel the proper place for employing full tree structure knowledge is in the implementation of an FS cleaner, not in the dynamic algorithms.
Consider dividing a file or directory into drops, with each drop having a separate key, and no two drops from one file or directory occupying the same node without being compressed into one drop. The key for each drop is set to the key for the object (file or directory) plus the offset of the drop within the object. For directories the offset is lexicographic and by filename, for files it is numeric and in bytes. In the course of several file system versions we have experimented with and implemented solid, liquid, and air drops. Solid drops were never shifted, and drops would only solidify when they occupied the entirety of a formatted node. Liquid drops are shifted in such a way that any liquid drop which spans a node fully occupies the space in its node. Like a physical liquid it is shiftable but not compressible. Air drops merely meet the balancing condition of the tree.
Reiserfs 0.2 implemented solid drops for all but the tail of files. If a file was at least one node in size it would align the start of the file with the start of a node, block aligning the file. This block alignment of the start of multi-drop files was a design error that wasted space: even if the locality of reference is so poor as to make one not want to read parts of semantically adjacent files, if the nodes are near to each other then the cost of reading an extra block is thoroughly dwarfed by the cost of the seek and rotation to reach the first node of the file. As a result the block alignment saves little in time, though it costs significant space for 4-20k files.
Reiserfs with block alignment of multi-drop files and no indirect items experienced the following rather interesting behavior that was partially responsible for making it only 88% space efficient for files that averaged 13k (the linux kernel) in size. When the tail of a larger than 4k file was followed in the tree ordering by another file larger than 4k, since the drop before was solid and aligned, and the drop afterwards was solid and aligned, no matter what size the tail was, it occupied an entire node.
In the current version we place all but the tail of large files into a level of the tree reserved for full unformatted nodes, and create indirect items in the formatted nodes which point to the unformatted nodes. This is known in the database literature as the [BLOB] approach. This extra level added to the tree comes at the cost of making the tree less balanced (I consider the unformatted nodes pointed to as part of the tree) and increasing the maximal depth of the tree by 1. For medium sized files, the use of indirect items increases the cost of caching pointers by mixing data with them. The reduction in fanout often causes the read algorithms to fetch only one node at a time of the file being read more frequently, as one waits to read the uncached indirect item before reading the node with the file data. There are more parents per file read with the use of indirect items than with internal nodes, as a direct result of reduced fanout due to mixing tails and indirect items in the node. The most serious flaw is that these reads of various nodes necessary to the reading of the file have additional rotations and seeks compared to the case with drops. With my initial drop approach they are usually sequential in their disk layout, even the tail, and the internal node parent points to all of them in such a way that all of them that are contained by that parent or another internal node in cache can be requested at once in one sequential read. Non-sequential reads of nodes are more than an order of magnitude more costly than sequential reads, and this single consideration dominates effective read optimization. The performance loss for introducing indirect items (BLOBs) was marked, and unmistakable.
Unformatted nodes make file system recovery faster and less robust, in that one reads their indirect item rather than them to insert them into the recovered tree, and one cannot read them to confirm that their contents are from the file that an indirect item says they are from. In this they make reiserfs similar to an inode based system without logging.
A moderately better solution would have been to have simply eliminated the requirement for placement of the start of multi-node files at the start of nodes, rather than introducing BLOBs, and to have depended on the use of a file system cleaner to optimally pack the 80% of files that don't move frequently using algorithms that move even solid drops. Yet that still leaves the problem of formatted nodes not being efficient for mmap() purposes (one must copy them before writing rather than merely modifying their page table entries, and memory bandwidth is expensive even if CPU is cheap.)
For this reason I have the following plan for the next version. I will have three trees: one tree maps keys to unformatted nodes, one tree maps keys to formatted nodes, and one tree maps keys to directory entries and stat_data. Now it is only natural if you are thinking that that would mean that to read a file and access first the directory entry and stat_data, then the unformatted node, then the tail, one must hop long distances across the disk, going to first one tree and then the other. This is indeed why it took me two years to realize it could be made to work. My plan is to interleave the nodes of the three trees according to the following algorithm:
Block numbers are assigned to nodes when the nodes are created, or preserved, and someday will be assigned when the cleaner runs. The choice of block number is based on first determining what other node it should be placed near, and then finding the nearest free block that can be found in the elevator's current direction. Currently we use the left neighbor of the node in the tree as the node it should be placed near. This is nice and simple. Oh well. Time to create a virtual neighbor layer.
The new scheme will continue to first determine the node it should be placed near, and then start the search for an empty block from that spot, but it will use a more complicated determination of what node to place it near. This method will cause all nodes from the same packing locality to be near each other, will cause all directory entries and stat_data to be grouped together within that packing locality, and will interleaved formatted and unformatted nodes from the same packing locality. Pseudo-code is best for describing this:
/* for use by reiserfs_get_new_blocknrs
when determining where in the bitmap to
start the search for a free block,
and for use by read-ahead algorithm when
there are not enough nodes to the
right and in the same packing locality for
packing locality reading ahead purposes
*/
get_logical_layout_left_neighbors_blocknr(key
of current node)
{
/* Based on examination of current
node key and type, find the virtual neighbor of that node. */
If body node
if first body node of file
if (node in tail tree whose key is less but is in same packing locality
exists)
return blocknr of such node with largest key
else
find node with largest key less than key of current node in stat_data tree
return its blocknr
else
return blocknr of node in body tree with largest key less than key
of current node
else
if tail node
if (node in body tree belonging to same file as first tail of current node
exists)
return its blocknr
else if (node in tail tree with lesser delimiting key but same packing
locality exists)
return blocknr of such node with largest delimiting key
else
return blocknr of node with largest key less than key of current node in
stat_data tree
else /* is stat_data
tree node */
if stat_data node with lesser key from same packing locality exists
return blocknr of such node with largest key
else /* no node from same packing locality with lesser key exists */
return blocknr of node with largest delimiting key that is less than key
of current node
}
/* for use by packing locality read-ahead
*/
get_logical_layout_right_neighbors_blocknr(key
of current node)
{
right-handed version
of get_logical_layout_left_neighbors_blocknr logic
}
It is my hope that this will improve caching of pointers to unformatted nodes, plus improving caching of directory entries and stat_data, by separating them from file bodies to a greater extent. I also hope that it will improve read performance for 1-10k files, and that it will allow us to do this without decreasing space efficiency.
When one finds oneself with an NxN coding complexity issue, it usually indicates the need for adding a layer of abstraction. The NxN effect of the number of items on balancing code complexity is an instance of that design principle, and we will address it in the next major rewrite. The balancing code will employ a set of item operations which all item types must support. The balancing code will then invoke those operations without caring to understand any more of the meaning of an item's type than that it determines which item specific item operation handler is called. Adding a new item type, say a compressed item, will then merely require writing a set of item operations for that item rather than requiring modifying most parts of the balancing code as it does now.
We now feel that the function to determine what resources are needed to perform a balancing operation, fix_nodes(), might as well be written to decide what operations will be performed during balancing since it pretty much has to do so anyway. That way, the function that performs the balancing with the nodes locked, do_balance(), can be gutted of most of its complexity.
Reiserfs has the following architectural weaknesses that stem directly from the overhead of repacking to save space and increase block size: 1) when the tail (files < 4k are all tail) of a file grows large enough to occupy an entire node by itself it is removed from the formatted node(s) it resides in, and it is converted into an unformatted node ([FFS] pays a similar conversion cost for fragments), 2) a tail that is smaller than one node may be spread across two nodes which requires more I/O to read if locality of reference is poor, 3) aggregating multiple tails into one node introduces separation of file body from tail, which reduces read performance ([FFS] has a similar problem, and for reiserfs files near the node in size the effect can be significant), 4) when you add one byte to a file or tail that is not the last item in a formatted node, then on average half of the whole node is shifted in memory. If any of your applications perform I/O in such a way that they generate many small unbuffered writes, reiserfs will make you pay a higher price for not being able to buffer the I/O. Most applications that create substantial file system load employ effective I/O buffering, often simply as a result of using the I/O functions in the standard C libraries.
By avoiding accesses in small blocks/extents reiserfs improves I/O efficiency. Extent based file systems such as VxFS, and write-clustering systems such as ext2fs, are not so effective in applying these techniques that they choose to use 512-byte blocks rather than 1k blocks as their defaults. Ext2fs reports a 20% speedup when 4k rather than 1k blocks are used, but the authors of ext2fs advise the use of 1k blocks to avoid wasting space.
There are a number of worthwhile large file optimizations that have
not been added to either ext2fs or reiserfs, and both file systems are
somewhat primitive in this regard, reiserfs being the more primitive of
the two. Large files simply were not my research focus, and it being a
small research project I did not implement the many well known techniques
for enhancing large file I/O. The buffering algorithms are probably more
crucial than any other component in large file I/O, and partly out of a
desire for a fair comparison of the approaches I have not modified these.
I have added no significant optimizations for large files, beyond increasing
the block size, that are not found in ext2fs. Except for the size of the
blocks, there is not a large inherent difference between: 1) the cost of
adding a pointer to an unformatted node to my tree plus writing the node,
and 2) adding an address field to an inode plus writing the block. It is
likely that except for block size the primary determinants of high performance
large file access are orthogonal to the decision of whether to use balanced
tree algorithms for small and medium sized files.
Many file systems (almost all of the older ones) add serialization of I/O that is not requested by the application programmer, and provide no mechanism for the application programmer to avoid such serialization when it is not desired. This results in applications stalling on I/O without the application programmer desiring that they do so. Since the serialization is typically performed at file boundaries, the performance impact of unrequested serialization becomes progressively more dramatic as one decreases in file size below the track size, and for small files it can completely dominate performance.
The motivation for the massive inefficiency of such an approach is that the original UFS and FAT file systems lacked the ability to avoid becoming corrupted when only part of an update to the file system reached disk. To minimize this problem, file systems such as [FFS], [UFS], and [NTFS] employed aggressive unrequested synchronization. The large investment in skill developmnt with implementing older technologies that assume added serialization, by many in the filesystem developer community, means that the newer approaches of mostly avoiding serialization will be slow in their adoption.
Snyder discusses a file system that obtains high performance from a complete lack of disk synchronization, but is only suitable for temporary files that don't need to survive reboot. I think its known value to Solaris users indicates that the optimal buffering policy varies from file to file.
Ganger discusses methods for using ordering of writes rather than serialization for ensuring conventional file system meta-data integrity, [McVoy] previously suggested but did not implement ordering of buffer writes.
Ext2fs was a leader in avoiding unrequested serialization, and I have much personal experience with it that leads me to prefer compiles that are fast. Its breaking the POSIX standard is not in practice a problem: only a few applications have a need for adding an fsync() after an fclose() in their source code when they are ported to Linux. I might argue that it is the best actually implemented policy at this time of the various OSes for performing compiling, but I do not by any means claim it to be optimal. [ I would like to see it adopt a policy that all dirty buffers for files not flagged as temporary are queued for writing, and that the existence of a dirty buffer means that the disk is busy. This will require replacing buffer I/O locking with copy-on-write, but an idle disk is such a terrible thing to waste.:-) ]
[NTFS] by default adds unrequested serialization to an extent that even old fashioned file systems such as [FFS] do not, and its performance characteristics reflect that. In fairness, it should be said that it is the superior approach for most removable media without software control of ejection (e.g. IBM PC floppies).
Reiserfs employs a new scheme called preserve lists for ensuring recoverability, with unrequested serialization occuring only when disk space is low, and recoverability in principle being better than that of [FFS]. (It most likely will not be better in reality until we have gone through extensive testing.) I have avoided changing Linux buffering for reiserfs 1.0, in part so as to make the benchmarks more fair, in part out of a policy of doing as little additional to our core technology as possible before we ship a filesystem useful to Linux users.
Reiserfs, like ext2fs, allows the application programmer control over serialization, and for small files this is crucial to performance. Reiserfs and ext2fs default to unserialized I/O, and perform serialization on request (via fsync() ), but the critical issue is not what is the default, but whether there is a choice. I will go so far as to say that I currently err primarily on the side of not providing sufficient options to the programmer for specifying desired buffer delay policy.
We implemented for version 0.2 of our file system a system of write ordering that tracked all shifting of items in the tree, and ensured that no node that had had an item shifted from it was written before the node that had received the item was written. This is necessary to prevent a system crash from causing the loss of an item that might not be recently created. This tracking approach worked, and the overhead it imposed was not measurable in our benchmarks. When in the next version we changed to partially shifting items and increased the number of item types, this code grew out of control in its complexity. I decided to replace it with a far simpler to code scheme that was also more effective in typical usage patterns. This scheme was as follows:
If an item is shifted from a node, change the block that its buffer
will be written to. Change it to the nearest free block to the old blocks
left neighbor, and rather than freeing it, place the old block number on
a ``preserve list''. (Saying nearest is slightly simplistic, in that the
blocknr assignment function moves from the left neighbor in the
In practice, despite its seeming inelegance, it is remarkably effective and efficient. The alternatives such as logging cost overhead consistently, whereas preserve list overhead has an erraticity of occurrence that contributes to the intuitive inelegance. It is a bit odd how the log structured file system style of consistently paying overhead for writing twice and writing far from the semantic neighbor seems intuitively more elegant than the preserve list practice of paying occasionally for a sync and slightly for writing near to the optimal placement, yes? When comparing the approach to a log, please remember that one can use an FS cleaner with this approach as well, even if the data does not move as far, and if the spikes caused by occasionally syncing the disk when it is full are distasteful one can always add a gentler throttle to the file system to ensure that moments of consistency occur. Further, it lays the groundwork for a future fastboot mechanism in which prior to each pass of the elevator algorithm I will write the preserve list and the list of buffers on the elevator list to disk. Log structured file systems do not eliminate the need for recovery, they confine it. Storing on disk the preserve list and the elevator list, I will be able to confine recovery without employing a log. Only implementation will prove the extent of the overhead this mechanism costs....
Note that ext2fs avoids the metadata shifting problem for directory entries by never shrinking directories.
My benchmarks had their origins in the Andrew Benchmark, but I was motivated to innovate by the following issues: 1) Unlike the file systems it was designed to test, both ext2fs and reiserfs avoid the use of unrequested serialization, and as a result the CPU, and not either of the file systems, is the bottleneck for compiles. Unfortunately I am reduced to copying files using cp if I want to generate a sufficiently heavy load using a 200Mhz Pentium Pro. 2) Even the use of sed to search for all occurences of the word kangaroo was user process CPU bound for reiserfs (sed is just barely fast enough to not bottleneck ext2). I found it necessary to write a simple program to read a file without processing it. 3) I find it much more scientifically interesting to use benchmarks to illustrate the impact of different design tradeoffs as I vary file size and directory size than to engage in the usual discussion of which approach is better overall via a single numeric measure. Such measures necessarily involve an arbitrary controversial weighting of various aspects of performance with much referencing of tradition and authority figures, and so I defer such discussions of whether mini-vans can move more cargo than motorcycles to their appropriate marketing forums.
LADDIS is a network file system benchmark, and optimizing performance for NFS is a research topic of its own that I did not choose to pursue however important it might be.
In this benchmark, when creating a file I determine which order of magnitude its size will fall in, with one file in five being promoted into a higher size class than the base size, and in one in five of those being promoted to the next size, etc. The size of an individual file is randomly distributed within the size range of its category. Most bytes belong to the larger sizes, most files belong to the smaller sizes. I assume 100% inode space utilization for ext2fs. The file sets are identical in all ways for ext2fs and reiserfs (the pseudo-randomization uses the same seed of 1 for both filesystems, I avoided changing the seed to produce a more ``average'' randomization.)
Note that the literature has not yet produced a statistical distribution function for file sizes, though there are many differing statistics available on the net. The author does not think that this function provided here is correct for that purpose (it has too many large files), but it is perhaps adequate to show that the impact on total system performance of small files can be proportional to their number rather than their space consumption. I wish to note that the use of a file set containing files in the 0-100 byte range is in no way fair to ext2fs, which like most file systems was not specially designed for files below 1k, though its architecture is better than most file systems would be for such a file set due to its synchronization policy.
% To distinguish the effect of directory size on efficiency of directory
operations, I produce these statistics for the same fileset stored variously
in trees with one directory, 10 sub-directories, 100 sub-sub-directories,
and 1000 sub-sub-sub-directories. I use 16000 files with sizes in the table
below. Note that at 16000 files in one directory, ext2 is exceeding its
design intent, and it takes long enough to perform directory insertions
that it is not truly usable, and for this reason larger benchmarks were
not performed. Ratios were calculated using measured numbers with more
precision than those shown in the tables.
number of files | size of files | total size of files | total number of all files | total size of all files |
5 | 0-10M | 33,712,005 | ||
21 | 0-1M | 9,335,273 | ||
113 | 0-100K | 6,193,394 | ||
615 | 0-10K | 3,055,154 | 16000 | 54,385,534 |
3002 | 0-1K | 1,479,559 | ||
12244 | 0-100B | 610,149 |
ext2 | reiserfs | ratio: reiserfs/ext2 |
58% | 43% | 1.35 |
Phase of Benchmark: | files in one directory | files in ten subdirectories | files in 100 subsubdirectories | files in 1000 subsubsub-directories |
creation of 80/20 Banded File Set | 4:31/10.48=26 | 42.35/13.42=3.16 | 25.95/15.45=1.68 | 54.84/40.86=1.34 |
cp -r testdir testdir2 | 7:35/37.22=12 | 1:53/39.13=2.9 | 1:35/46.76=2.0 | 2:16/1:04=2.13 |
scanning all files | 6:32/57.40=6.83 | 2:42/55.62=2.9 | 2:29/59.79=2.5 | 2:54/1:33=1.9 |
recursive stat of all files
(find . -exec ls -l {}>&/dev/null \;) |
9:05/3:35=2.63
(find command was bottleneck for reiserfs, not ext2) |
5:11/3:35=1.44
(find command was bottleneck for reiserfs, not ext2) |
4:14/3:38=1.17
(find command was bottleneck) |
5:05/4:01=1.27
(find command was bottleneck) |
copies semantically interleaved with originals
(find linux -name '*' cp {} {}.o \;) |
30:30/3:59=7.7
(find command was significant overhead for reiserfs) |
5:39/4:19=1.3
(find command was significant overhead for reiserfs) |
5:39/4:19=1.3
(find command was significant overhead for reiserfs) |
7:06/4:49=1.5
(find command was significant overhead for reiserfs) |
accessing stat data(du -a | wc -l) | 15:23/35.71=26 | 2:37/37.59=4.2 | 1:29/42.58=2.08 | 1:59/38.45=3.1 |
rm -rf * | 3:32/21.15=10 | 56.17/22.57=2.5 | 53.52/25.72=2.08 | 1:58/29.93=4.0 |
Phase of Benchmark: | files in one directory | files in ten subdirectories | files in 100 subsubdirectories | files in 1000 subsubsub-directories |
creation of File Set | 4:15/4.51=57 | 26.16/04.44=5.9 | 11.10/5.24=2.11 | 25.03/16.40=1.53 |
cp -r testdir testdir2 | 7:22/17.58=25 | 1:30/16.74=5.4 | 1:08/16.74=4.0 | 1:52/19.32=5.8 |
scanning all files | 6:15/36.35=10.4 | 2:26/35.69=4.1 | 5:33/1:11=4.7 | 4.1 |
recursive stat of all files
(find . -exec ls -l {}>&/dev/null \;) |
9:02/3:32=2.6
(find command was bottleneck for reiserfs, not ext2) |
5:09/3:33=1.45
(find command was bottleneck for reiserfs, not ext2) |
4:04/3:32=1.15
(find command was bottleneck) |
4:17/3:47=1.14
(find command was bottleneck) |
copy semantically interleaved with originals
(find linux -name '*' cp {} {}.o \;) |
30:01/3:04=10
(find command was significant overhead for reiserfs) |
6:59/3:10=2.21
(find command was significant overhead for reiserfs) |
5:05/3:12=1.6
(find command was significant overhead for reiserfs) |
6:33/3:14=2.02
(find command was significant overhead for reiserfs) |
accessing stat data(du -a | wc -l) | 15:25/6.38=145 | 2:29/6.33=24 | 1:14/6.31=12 | 1:07/6.79=10 |
rm -rf * | 3:05/19.42=9.5 | 27.45/21.72=1.26 | 16.29/24.38=0.67 | 40.97/27.97=1.46 |
ext2 | reiserfs | ratio: reiserfs/ext2 |
14% | 3% | 5.0 |
Phase of Benchmark | Time for ext2fs | Time for reiserfs | ext2fs/reiserfs |
three simultaneous copies of file set (same source, three different semantic locations in file system) | 2:40 | 2:16 | 1.18 |
three simultaneous semantically interleaved copies
(different sources: find $i -exec cp {} {}.o \;) |
3:35 | 2:51 | 1.25 |
three simultaneous recursive stats of all files
(find $i -exec ls -l {}) |
2:28 | 3:03 | 0.82 |
three simultaneous reads of three trees | 1:12 | 1:21 | 0.89 |
three simultaneous deletions
(rm -rf $i) |
0:34.31 | 0:25 | 1.37 |
Phase of Benchmark | Time for ext2fs | Time for reiserfs | ext2fs/reiserfs |
cp -r of linux kernel source code | 38.2 | 31.85 | 1.13 |
scanning all files (find . -exec wc {} >&/dev/null \;) | 2:51 | 2:40 | 1.07 |
recursive stat of all files (du -s * >&/dev/null) | 2:58 | 3:09 | 0.94 |
copy all .c files to .o files for seven linux kernels
(copies are semantically interleaved with originals) find linux -name '*.c' cp {} {}.o \; |
3:25.27 | 2:59 | 1.23 |
find and then semantically interleaved delete files
(find -type f -name '*[aos]*' -exec rm {} \;) |
3:56.82 | 3:07 | 1.56 |
rm -rf * | 37.79 | 5.38 | 10.1 |
Phase of Benchmark | Time for ext2fs | Time for reiserfs | ext2fs/reiserfs |
cp -r of file set | 0:45.79 | 0:36.39 | 1.26 |
semantically interleaved copying of 163MB
(find usr cp {} {}.o \;) |
2:05.31 | 1:16.56 | 1.64 |
recursive stat of all files (find . -exec ls -l \;) | 0:03.44 | 0:03.02 | 1.14 |
scanning all files before interleaved copy is performed (find . -exec wc {} >&/dev/null \;) | 0:31.44 | 0:27.68 | 1.14 |
scanning all files after interleaved copy is performed (find . -exec wc {} >&/dev/null \;) | 1:00.55 | 0:59.22 | 1.02 |
find and then semantically interleaved delete files
(find -type f -name '*[aos]*' -exec rm {} \;) |
0:25.88 | 0:02.38 | 11 |
rm -rf * | 0:02.63 | 0:00.16 | 16 |
Phase of Benchmark | Time for ext2fs | Time for reiserfs | ext2fs/reiserfs |
three simultaneous copies of the same 97 MB file set | 44.14 | 36.52 | 1.21 |
three simultaneous semantically interleaved copies of 128MB
(different sources)
(find usr cp {} {}.o \;) |
64.89 | 54.96 | 1.18 |
recursive stat of all files (find . -exec ls -l \;) | 1.72 | 1.44 | 1.19 |
scanning all files (find . -exec wc {} >&/dev/null \;) | 15.78 | 14.36 | 1.10 |
three find and then semantically interleaved deletions of
files
(find $i -type f -name '*[aos]*' -exec rm {} \;) |
6.5 | 0.81 | 8.0 |
three simultaneous tree deletes (rm -rf $i) | 7.78 | 0.92 | 8.5 |
In the Preserve Lists section I speculate
on the possibilities for a fastboot implementation that would merge the
better features of preserve lists and logging.
I believe it is not so much the cost that has made Linux so successful as it is the openness. Linux is a decentralized economy with honor and recognition as the currency of payment (and thus there is much honor in it). Commercial OS vendors are, at the moment, all closed economies, and doomed to fall in their competition with open economies just as communism eventually fell. At some point an OS vendor will realize that if it:
Since I have more recognition than money to pass around as reward, my policy is to tend to require that those who contribute substantial software to this project have their names attached to a user visible portion of the project. This official policy helps me deal with folks like Vladimir, who was much too modest to ever name the file system checker vsck without my insisting. Smaller contributions are to be noted in the source code, and the acknowledgements section of this paper.
If you choose to contribute to this file system, and your work is accepted into the primary release, you should let me know if you want me to look for opportunities to integrate you into contracts from commercial vendors. Through packaging ourselves as a group, we are more marketable to such OS vendors. Many of us have spent too much time working at day jobs unrelated to our Linux work. This is too hard, and I hope to make things easier for us all. If you like this business model of selling GPL'd component software with related support services, but you write software not related to this file system, I encourage you to form a component supplier company also. Opportunities may arise for us to cooperate in our marketing, and I will be happy to do so.
G.M. Adel'son-Vel'skii and E.M. Landis, An algorithm for the organization of information, Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees.
[Apple] Inside Macintosh, Files, by Apple Computer Inc., Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting file system architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger, and the code has not received sufficient further development.
[Bach] Maurice J. Bach, ``The Design of the Unix Operating System'', 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the file system routines and interfaces in a manner especially useful for those trying to implement a Unix compatible file system. See [Vahalia].
[BLOB] R. Haskin, Raymond A. Lorie: On Extending the Functions of a Relational Database System. SIGMOD Conference (body of paper not on web) 1982: 207-212, See Drops section for a discussion of how this approach makes the tree less balanced, and the effect that has on performance.
[Chen] Chen, P.M. Patterson, David A., A New Approach to I/O Performance Evaluation---Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page.
[C-FFS] Ganger, Gregory R., Kaashoek, M. Frans, page with link to postscript paper A very well written paper focused on small file issues, they use markedly similar notions (most especially their concept of grouping compared to my packing localities), and they do so without bothering to credit our work despite our having posted our intial results and the underlying concepts on the web immediately prior to what would seem to be the start of their work according to their web page. To receive such flattery from MIT is most unexpected.
[ext2fs] by Remi Card extensive information, source code is available When you consider how small this file system is (~6000 lines), its effectiveness becomes all the more remarkable.
[FFS] M.K. McKusick, W.N. Joy, S.J. Leffler, and R.S. Fabry. A fast file system for UNIX. ACM Transactions on Computer Systems, 2(3):181--197, August 1984 describes the implementation of a file system which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation file systems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (reiserfs uses preserve lists, forgive my egotism in thinking that it is enough work for me to ensure that reiserfs solves the recovery problem, and to perhaps suggest that ext2fs would benefit from the use of preserve lists when shrinking directories)
[Ganger] Gregory R. Ganger, Yale N. Patt, ``Metadata Update Performance in File Systems'' abstract only
[Gifford], postscript only Describes a file system enriched to have more than hierarchical semantics, he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure.
[Holton and Das] , Holton, Mike., Das, Raj., ``The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file).'' Note that it is still a block (extent) allocation based file system, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from reiserfs, and is an excellent design for that purpose (and I would concede that reiserfs 1.0 is not suitable for their real-time large I/O market.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details, and source code is not freely available.
[Howard] ``Scale and Performance in a Distributed File System'', Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J., ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound for both ext2fs and reiserfs.
[Knuth] Knuth, D.E., The Art of Computer Programming, Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length.
[LADDIS] Wittle, Mark., and Bruce, Keith., ``LADDIS: The Next Generation in NFS File Server Benchmarking'', Proceedings of the Summer 1993 USENIX Conference.'', July 1993, pp. 111-128
[Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry ``Data Structures & Their Algorithms'', HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors.
[McCreight] McCreight, E.M., Pagination of B*-trees with variable length records, Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records.
[McVoy and Kleiman], the implementation of write-clustering for Sun's UFS.
[OLE] ``Inside OLE'' by Kraig Brockshmidt, discusses Structured Storage, HREF="http://www.microsoft.com/mspress/books/abs/5-843-2b.htm" abstract only
[Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. A trace-driven analysis of the UNIX 4.2BSD file system. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15--24, Orcas Island, WA, December 1985.
[NTFS] ``Inside the Windows NT File System'' the book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space.
[Peacock] K. Peacock, ``The CounterPoint Fast File System'', Proceedings of the Usenix Conference Winter 1988
[Pike] Rob Pike and Peter Weinberger, The Hideous Name, USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. It is a pity he no longer has this paper on his home page, it was short, informal, and drove home why inconsistent naming schemes in an OS are detrimental. A discussion of naming in plan9 that is on the web.
[Rosenblum and Ousterhout] ``The Design and Implementation of a Log-Structured File System'', Mendel Rosenblum and John K. Ousterhout, February 1992 ACM Transactions on Computer Systems, this paper was quite influential in a number of ways on many modern file systems, and the notion of using a cleaner may be applied to a future release of reiserfs. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse http://www.sunlabs.com/~ouster/seltzer93.html and the arguments by Margo Seltzer it links to.
[Snyder] , ``tmpfs: A Virtual Memory File System'' discusses a file system built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance.