Variable Length Records

published: Wed, 7-Apr-2004   |   updated: Tue, 20-Apr-2004
Birdhouse

I had a question from a reader recently:

Where should I be looking to convert a simple B-Tree/Record manager system into one that can handle varying length records?

This is an interesting question, with many possible answers. Writing a record manager for fixed length records is fairly easy, but writing one for variable-length records?

I replied with a small set of three possible algorithms:

1. Use a text file approach. By this I mean that the records appear one after the other, in sequence, no gaps, with either a record termination character to delimit records, or a length word at the start to define the length of the record (a bit like the Pascal shortstring).

The problem with 1a, if I may call it that, is that you have to make sure that your record doesn't contain the record separator character. For text files, this is easy enough. These days, you'd probably like to have binary data in some form so you don't have to convert back and forth, so this scheme is not often used (apart from text files, of course).

For reading, it's fairly easy, if longwinded, to write a routine that calculates the offset of a particular record. Essentially, you assume that the file will be read sequentially from beginning to end. As you read the file, you build up an array of record offsets. The array is fixed in size, say 100 items, so eventually you'll fill it up. Discard every other item and pack the rest. You'll then have an array of offsets to every *other* record for the first 100 records, with room to add 50 more offsets. Eventually you'll fill that up too once you've read 200 records. So drop every other item, pack the rest and the array becomes an array of offsets to every fourth record. Repeat ad infinitum, or at least until you reach the end of the file.

Once the index has been built, find the nearest offset in your array to your record, seek there, and then read records until you find the one you want. The problem though is this: there will be times when you ask for the 10,000th record or so without having built the index first. Nothing for it but to read the first 10,000 records, building up the offset array as you go. I implemented this algorithm in a text file stream for SysTools. You can get the source from sourceforge.net. An small improvement on this is to persist the index into its own file, but then you'll be faced with the problems of regenerating the index every time you write to the file of records.

The biggest problem with algorithm 1 as a whole is that insertion and deletion of records is a real pain in the neck. In essence the only way to do it is to read the entire file into memory into an array, do the insertions and deletions you want, and then write the entire thing back out again. For a text file, this is exactly what a text editor like Notepad or <your IDE of choice> does: it reads it all in, you edit to your heart's content, and then it all gets written out again, overwriting the old file that's on disk.

2. Simple FAT type system. Here you have some kind of header in your file to hold information about the rest of the file. The rest of the file is divided up into fixed size chunks, say 512 bytes. Records would occupy a whole number of the 512-byte chunks. The header would be a bitmap (as in an array of bits, not a pretty picture) that showed which chunks were used and which weren't. So if the file could contain a maximum of 1Gb of data, there would be about 2 million chunks, and the bitmap would be a quarter of a meg in size.

For this you'd need an external index that ordered the record in some fashion and that had pointers to each record's first chunk.

Problems abound, making the algorithm interesting to implement. Although you know where records start from your external index, how you know where the other chunks are that make up the record? The easy way is to set aside a long value (4 bytes) in the chunk that contains the number of the next chunk. If it's -1 there is no next chuck.

Another: All records would have to have a record length stored somewhere (probably in the first chunk), otherwise it becomes impossible to know where the record ends in its last chunk. So, in the first chunk you'll be able to store at most 504 bytes, in subsequent chunks at most 508 bytes.

Another: You have to read the entire bitmap in if you wanted to do any updates to the records, and write it out again on closing. If the bitmap becomes corrupt, kiss you data goodbye, sine it will be nigh on impossible to reconstruct it.

Another: you have to define the maximum size the file is going to get when you create it. I gave the example of 1GB of data requiring a bitmap of 256KB: you have to reserve space for this right at the beginning. So for a brand new file with no records, its size would be 256KB from the start. Of course you can play around with chunk size and so on (depending on the average size of your records) to alleviate the initial hit, but you will always have that hit.

Another: You can't make your chunk sizes too large in order to reduce the initial bitmap size hit because, on average, your records will waste one half of the chunk size per record. So, for a chunk size of 512 bytes, on average you'll waste 256 bytes per record.

(Note that there are other variants of the FAT type algorithm, including the one used by the DOS FAT system, where the chains of chunks are built in the FAT. Each FAT entry is in fact a pointer to the next chunk, or is -1 for end of chain, or is -2 for unused. Or something like that.)

3. Heap system. This is the most intricate. Essentially it mimics the heap memory manager that's built into the run-time library. It makes use of a free list of space available on the file.

Here's a brief sketch. Decide on a granularity size, the larger it is the less random access reads you'll have to do, the smaller it is the less space you'll have to waste. Let's say it's 128 bytes in size.

Create a file with a single 128 byte chunk, defining the initial free list. A free list is merely a linked list where each node contains a pointer (an offset in the file) to the next block of free space (if there is none, it'll be -1) and the size of this block of free space. For the initially created file, the first 128 byte chunk merely contains the free list head node, a node that points to the start of the free list. For the initial file there is no free list, so the pointer is -1.

Say you need to write a record. Proceed like this: read the head node, find the first free block. If there is no free block, add the record to the end of the file. If there is a free block, is it big enough for this record (round the record size up to the granularity value)? No? Follow the free list node next pointer to the next free block. Eventually you'll find a block large enough, or you'll reach the end of the free list and just have to write the record at the end of the file. This process will take one or more reads until you find a place to write the record.

Once you find a block big enough, there's work to do before you write the record. You need to patch up the free list for a start. Say the free block was exactly the right size: you'll have to patch up the previous free list node to point to the next free list node (one write). If it's not the right size, you'll have a small free block left over after the record is written. You'll have to construct a free list node there to point to the next free list node (one write), and then patch up the previous free list node to point to this new one (one write). Oh and don't forget to keep the block sizes in the nodes up to date.

Of course writing a record at the end of the file means you don't have to update the free list.

Now for deletion. Easy-peasy in one way, you just make a new free list node where the record was, point it to the free list node from the free list head node, and make the free list head node point to this new space. (one read and two writes).

One issue (which you can ignore if you want), is what to do about two or more free blocks that happen to be next to each other in the file (but not necessarily next to each other in the free list). It would be nice to merge them together. A difficult exercise this one.

Another issue (also can be ignored). This algorithm uses the "find first fit" method. It stops looking for free space once it finds a block big enough. This may cause a split of the free block into a block big enough for the record and a small probably unusable block for the remainder of the free space. Another method is "find best fit" where you look through the entire free list until (1) you find a block that's exactly the right size, or (2) you've read through the entire list and are able to choose the best sized block.

A small issue is that you're going to be writing data, maintaining linked lists, all in one atomic operation. If the program crashes whilst updating the free list, you'll have a difficult job ahead of you, reconstructing the data. At least, the records are contiguous so it makes the job a little easier.

You could create the free list inside the 128-byte chunks, making it easier to update several links at once, as well as giving you the opportunity to find occasions to merge free blocks, but then we start veering towards a DOS FAT system again.

Another variant on this theme is to make the file a set of fixed- length records, sized according to the granularity value. When you want to write a record you go to the free list of fixed-sized records, get the number of records required to contain your record and split your record up into fixed-sized chunks. (You would sometimes just add records to the end of the file, obviously). Your variable length record sliced and diced amongst a set of fixed-sized records, with the caveat that you'll have to maintain a list of "next chunk" offsets as part of these records. When you delete a record using this variant you would add all of the released fixed-sized chunks to the free list. If the system crashes during an update, it's very hard to reassemble the individual records as they've been split amongst several fixed-sized records. This was the scheme used by TurboPower's old B-Tree Filer.

I hope that gives you some thoughts on how to implement a variable- length record manager. All implementations are left as interesting exercises for the reader, but if you want one I'm sure we could come to some arrangement...