Working with (and against) the file system

Files are a wonderful abstraction, a stream of bytes that reside under name, sorted in a hierarchy. Simple enough that a child can use it, powerful enough to be the motto of an the entire set of operating systems. "Everything is a file" is one of the defining features of Unix, but it is also an abstraction, and as such, it is subject to the Law of Leaky Abstractions.

When building a storage engine, we need to have a pretty good idea about how to manage files. As it turns out, there is a lot of things that are just wrong about how we think about files. The "All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications" paper tested ten applications (from SQLite to Git to PostgreSQL) to find whatever they are properly writing to files. This paper is usually referred to as the ALICE (Application-Level Intelligent Crash Explorer) paper, after the name of the tool created to explore failures in file system usage.

There are a lot of details that you need to take into account. For example, you may consider that changing a file and then calling fsync() will ensure that the changes to the file are made durable, and that is correct, if you haven’t changed the file size. Because while the file data has been flushed, the file metadata was not. Which may mean some fun times in the future with the debugger.

Testing actual behavior is hard, and the real world isn’t cooperative

If you want to build a reliable storage engine, you are going to need to develop some paranoid tendencies. Because the fact that the documentation says something is possible or not doesn’t translate to how things really are. When you care about reliability, the file system abstraction leaks badly. And it is hard to find a way around that.

You can simulate some errors, but the sheer variety and scope involved makes thing hard. Especially because the error that hit you may come not from the code that you are using but several layers down. If you test errors from the file system, but the file system didn’t check for errors from the block device you may end up in a funny state. And in 99.999% of the cases, it won’t matter, but you are going to hit that one in a billion chance of an actual error at just the wrong time and see a broken system.

There have been a number of studies made on the topic and I think that Daniel Luu’s summary of the topic lays a lot of the issues on the table. There is not a single storage hardware solution that doesn’t have failure conditions that can cause data corruption. And file systems don’t always handle it properly.

To a large degree, we make certain assumptions about the system, then we verify them constantly. The less we demand from the bottom layers, the more reliable we can make our system.

LWN has some good articles on the topic of making sure that the data actually reach the disk and the complexities involved. The situation is made more complex by the fact that this is depend on what OS and file system you use and even what mode you used to mount a particular drive. As the author of a storage engine, you have to deal with these details in either of two ways:

Specify explicitly the support configuration, raise hell if user is attempting to use on non supported configuration.
Make it work across the board. Plan for failure and when it happens, have the facilities in place to recover from it.

gavran/pal.h - The file system interface that we’ll consume for our storage engine

link:../include/gavran/pal.h[role=include]

Because working with files is such a huge complex mess, and because it is different across operating systems, we’ll hide this complexity behind a platform abstraction layer (PAL). gavran/pal.h - The file system interface that we’ll consume for our storage engine shows the core functions that the PAL expose. The interface shown in gavran/pal.h - The file system interface that we’ll consume for our storage engine is very small and quite big at the same time. It it small to be the entire interface between a storage engine and a file system. At the same time, there is quite a bit that is going on here.

There are two important structures that we define here: span_t and file_handle_t. The span_t represent a range of memory, and it is named in honor of Span<T> from .NET. This is going to be used for memory mapping and in general to represent arbitrarily sized chunks of memory. The file_handle_t is going going to be used to represent a file handle.

Note

Cross platform considerations

On Posix, a file description is an int, on Windows, it is a HANDLE. Right now I’m focusing only on Linux support, so I won’t bother much with cross platform compatibility at this point. But the structure is already ready to handle either option, so it should be smooth sailing.

And yes, I’m aware of "Famous last words".

The goal of this API is to allow us to do the right thing, regardless of what platform we are running on. Later on, we’ll build on top of these functions the storage engine implementation. Let’s look at test.c - Using the file API to create a file with a minimum size to see how we can make use of this API to work with files.

test.c - Using the file API to create a file with a minimum size

link:./code/test.c[role=include]

The code in test.c - Using the file API to create a file with a minimum size should ensure that at the end of the day, we have a file that has a minimum size of 8KB which will retain its size even in the case of an error or a system crash. That sounds easy enough to do in theory, but require some dancing around to get to it. Right now I’m going to focus on Linux as the implementation system, but we’ll get to other systems down the line.

Creating a file handle

Now that we know how to use the API, let’s see how this is implemented. pal.linux.c - Opening (or creating) a file shows how we open or create a file in Gavran. It isn’t a long function, but you can see how all the different parts we built are coming together to make it (much) easier to write correct code.

pal.linux.c - Opening (or creating) a file

link:./code/pal.linux.c[role=include]

Allocate the memory for the handle and ensure that it will be freed if the function fails.
Create a mutable copy of the file name and then ensuring that the file exists (which requires mutable string).
Resolve potentially relative path to an absolute one.
Open the actual file, for now, you can ignore the meaning of the flags, we’ll discuss them later.
Call fsync() on the parent directory if this is a new file, we’ll see why this is needed shortly.

pal.linux.c - Opening (or creating) a file starts by allocating a temporary string and then calling pal_ensure_path, which will ensure that the path to the file is valid and create it as an empty file. It then proceeds to allocate the memory for the handle and setting up try_defer() to ensure that in the case of failure, we’ll free the allocated memory automatically. We then store a copy of the path in the handle’s filename field. Note that we are doing that through realpath(), which will resolve relative paths, symbolic links, etc. We need to free() the memory there. Because realpath requires a file to exists before it can resolve it, we create an empty file in pal_ensure_path() and use the size of the file to know if we are dealing with an existing database or a new one.

Beyond what was already mentioned, you need to take into account users who pass invalid values (file name containing /, for example), all the intricacies of soft and hard links, size quotas, etc. The LWN post about this will probably turn your hair gray. To keep the code size small and not overburden ourself with validation code, I’m going to state that I’m trusting the callers of the API to have already done the validation of the data. As you can see in pal.linux.c - Opening (or creating) a file, we are only trying to prevent accidents, not trying to protect against malicious input. Gavran assumes that it is called with non-malicious values (note that this is different from valid values) as it is an embedded library and is meant to be used as part of your system.

There is one very important validation / setup that we run before we open the file. We call pal_ensure_path(), which ensure that we can create the file. It will create the full directory path if needed, make sure that there isn’t a directory with the name in place and in general tidy up the place before we get started. You can see how it is implemented in pal.linux.c - Ensuring that the provided path is valid to create a file on.

pal.linux.c - Ensuring that the provided path is valid to create a file on

link:./code/pal.linux.c[role=include]

Another responsibility that pal.linux.c - Opening (or creating) a file has, aside from creating the full director path, is to let us know if we are creating a new file or opening an existing one. Why do we care? We are opening the file using O_CREAT, so the operating system will take care of that detail for us.

Warning

Beware the paths

In pal.linux.c - Opening (or creating) a file, I’m keeping track of the absolute path of the file. Even if we were provided with a relative path, I’m converting that to the full one. Why make extra work for ourselves? I could have simply called strdup() and call it a day, no?

Let’s assume that we accept the path "db/phones", which is a relative path. It gets resolved based on the current directory, as expected. However, our storage engine is meant to be embedded. What will happen if the parent application decide to change its output directory? That will mean that our relative file name will point to a different location, which is going to cause unforeseen issues. Better to stop this early.

On Linux, once you opened a file, you no longer have access to its name. It may have multiple names (hard links) or non (anonymous or have been deleted). That is why I’m keeping track of the provided name (after resolving to the real path) as part of the handle. It is too good a source of information to just discard.

The act of creating a file is a non trivial operation, since we need to make sure that the file creation is atomic and durable. pal.linux.c - Calling fsync() on a parent directory to ensure that the file metadata has been preserved deals with a fairly nasty problem with Linux, management of file metadata. In particular, adding a file or changing the size of a file will cause changes not to the file itself, but to its parent directory. This isn’t actually true, but it is a fairly good lie in the sense that it gives you enough information to have a mostly correct gut feeling about how things work.

If you want to understand how it works in more depth, I would recommend reading an excellent book on the topic: Practical File System Design with the Be File System. It is an old one (I read it the first time close to two decades ago) but it does an excellent job of covering a lot of the details. At under 250 pages, it makes for an excellent read for a very complex topic. Another resource you might want to consider is GotenksFS, a file system explicitly designed for learning purposes. There is also libfuse which allows you to define file systems in user space, which can be really interesting peek into how file systems really work.

Note

What about fork support?

When opening the file, we use O_CLOEXEC. Meaning that we don’t want to share this file descriptor across forks. The design of Gavran isn’t going to be friendly to multi process access and multiple processes sharing the same file descriptors can be problematic. For that reason, we ask the operating system to close any such files explicitly.

Metadata updates not being part of fsync() make it possible for you to make changes to a file (creating the file or increasing its size), call fsync() on the file and then losing data because the existence or the size of the file wasn’t properly persist to stable medium. I’ll refer you to LWN again for the gory details.

pal.linux.c - Opening (or creating) a file and pal.linux.c - Calling fsync() on a parent directory to ensure that the file metadata has been preserved has quite a lot of error handling, but most of it isn’t visible, hidden in the defer() call which simplify the overall system logic. Why am I being so paranoid about error handling? To the point I defined a whole infrastructure for managing that before writing a single byte to a file? The answer is simple, we aim to create an ACID storage engine, one which will take data and keep it. As such, we have to be aware that the underlying system can fail in interesting ways.

pal.linux.c - Calling fsync() on a parent directory to ensure that the file metadata has been preserved

link:./code/pal.linux.c[role=include]

If there is no / in the path, assume that this is from the current directory
We need to find the name of the parent directory, we add a null in the right location to find it.

The ALICE paper has found numerous issue is projects that have been heavily battle tested. And a few years ago that have been a case of data loss in PostgreSQL that has been track down to not checking the return value of an fsync() call. This LWN article summarize the incident quite well. If we aim to build a robust system, we must assume that anything can fail, and react accordingly.

Warning

What happens if close() fail from the defer()?

In pal.linux.c - Calling fsync() on a parent directory to ensure that the file metadata has been preserved, we setup close() to be invoked automatically by the compiler when the scope ends. However, what happens if there is an error in the close ? That happens after the return statement, so the return value was already selected. We handle this by pushing an error from a deferred operation and checking errors_get_count() to validate that there are no surprises.

And while it may seem funny, close can fail. Here is the LWN discussion on the topic . And StackOveflow has an interesting story of the result of missing a call to close(). We are going to handle this differently, though. Errors during cleanup routines are already very hard to deal with. If we have to deal with them routinely that would be a pain. Instead, we are going to try to set things up so this doesn’t matter.

With flopped() also validating the errors_count(), we know that even if the error happened in the defer() call, we’ll detect and treat it just the same.

What pal.linux.c - Calling fsync() on a parent directory to ensure that the file metadata has been preserved does, essentially, is to open() the parent directory using O_RDONLY and then call fsync() on the returned file descriptor. This instructs the file system to properly persist the directory information and protect us from losing a new file. Note that we rely on the fact that strings are mutable in C to truncate the file value by adding a null terminator for the parent directory (we restore it immediately afterward). This trick allows us to avoid allocating memory during these operations. It is safe to make this change in memory because the string we mutate is the one belonging to the handle. So that is our own memory and we are fine to modify it.

The reason we need to call fsync_parent_directory() is that to make the life of the user easier, we are going to create the file if it does not exists, including any parent directories. And if we are creating these directories, we need to ensure that they won’t go away because of file system metadata issues.

Sadly, safely creating the full path is a somewhat tedious task. You can see how we approach it in pal.linux.c - Ensuring that the full path provide exists and the caller has access to it, where quite a lot is going on. If the file already exists, we can return successfully immediately. If the file is a directory, we return an error. We then scan the path one directory at a time and check if the directory exists. I’m using a trick here by putting a 0 in the place of the current directory separator.

pal.linux.c - Ensuring that the full path provide exists and the caller has access to it

link:./code/pal.linux.c[role=include]

In other words, given a /db/phones\0, we’ll start the search for / from the second character and then place a null terminator on that position. The filename would then be: /db\0phones\0. If needed, we will create the directory and then move to the next one. As part of that, we’ll set the / again and continue from the next /. This code requires a mutable string to work, but it does the work with no allocations, which is nice.

Reducing user frustration

Automatically creating a directory when given a path is a small feature. Call mkdir() before open, pretty much. Doing that reliably is more complex, but not too hard. There are things you need to consider when doing this, what permissions should you give, how to handle any soft or hard links that you find along the way, etc.

On the other hand, not creating the directory automatically adds a tiny bump of frustration for the user. There is an extra, usually unnecessary step to take along the way.

If we need to create a new directory, we make sure to call fsync_parent_directory() to ensure that a power failure will not cause the directory to go poof. As careful as I am being here, note that there are many scenarios that I’m not trying to cover. Using soft and hard links or junction points is the first example that pops to mind. And double the work if you need to deal with files or paths that come from un-trusted source. OWASP has quite a bit to talk about in terms of the kind of vulnerabilities that this might expose.

Earlier I discussed wanting to get the proper primitives and get as far away from the level of code that you would usually need to write to deal with the file system, I think that now it is much clearer exactly why I want to get to that level as soon as I can.

Setting the file’s size

When creating a file, it is created with zero bytes. That makes perfect sense, after all. There is no data here. When you’ll write to the file, the file system will allocate the additional space needed on the fly. This is simple, require no thinking on our part and exactly the wrong thing to want in a storage engine.

We just saw how hard we have to work to properly ensure that changes to the metadata (such as, for example, changing its size) are properly protected against possible power failures. If we would need to call fsync_parent_directory() after every write, we can kiss our hopes for good performance goodbye. Instead of letting the file system allocate the disk space for our file on the fly, we’ll ask it for the space in advance, in well known locations. That will ensure that we only rarely need to call fsync_parent_directory().

Requesting the disk space in advance has another major benefit, it gives the file system the most information about how much disk space we want. It means that we give the file system the chance to give us long sequences of consecutive disk space. In the age of SSD and NVMe it isn’t as critical as it used to be, but it still matters. Depending on your age, you may recall running defrag to gain substantial performance increase on your system or have never heard about it at all.

pal.linux.c - Pre-allocate disk space by letting the file system know ahead of time what we need shows how we request that the file system allocate enough disk space for us. At its core, we simply call to ftruncate() which will extend the file for us, if needed. The function allows you to specify minimum and maximum file size because we’ll need to support truncation of files in the future, and if the amount of code is small enough, I would rather show you the whole thing at once, rather than tease it out.

pal.linux.c - Pre-allocate disk space by letting the file system know ahead of time what we need

link:./code/pal.linux.c[role=include]

Note that in pal.linux.c - Pre-allocate disk space by letting the file system know ahead of time what we need we call to fsync_parent_directory(). That isn’t interesting on its on, it is interesting because we need to pass it a mutable string. Of course, fsync_parent_directory() makes sure to return things to normal by the time it returns, but it means that we cannot pass it a constant value and expect things to work and handle→filename is going to be part of our database’s shared state, so we can’t mutate it casually. There might be other threads peeking at the value while it is being mutated. Therefor, we have to use a temporary copy. Convenience function such as mem_duplicate_string() and using defer() make this a breeze.

Warning

The perils of paths

I’m doing dynamic memory allocation in pal.linux.c - Pre-allocate disk space by letting the file system know ahead of time what we need, but I could do a static buffer of size PATH_MAX, wouldn’t that be better? There are actually scenarios where PATH_MAX is not sufficient. See this post a full discussion of the perils of using PATH_MAX.

One of tha advantages of stack allocation is that it will be automatically cleaned up. We get the same behavior via defer(), so that is great. There is a non trivial cost for allocating memory, though. We’ll discuss memory allocation strategies when we get to benchmarking, in the last part of this book.

Closing a file

After quite a journey, we are almost at the end. The only function we are left to implement to be able to compile the code in test.c - Using the file API to create a file with a minimum size is pal_close_file(). You can see the code in pal.linux.c - Closing a file.

pal.linux.c - Closing a file

link:./code/pal.linux.c[role=include]

The pal_close_file() function simple call the close() method and add some additional error handling, nothing more. We aren’t trying to call fsync() or do any fancy things at this layer. That will be the responsibility of higher tiers in the code. Note that we setup the handle→filename and the handle itself to be freed, but we aren’t calling that explicitly. I find that it is cleaner to do things this way. I don’t need to think about the error conditions that I may need to cover.

One thing that deserve calling out here, an error from close() isn’t theoretical. In almost all cases, whenever you do I/O, you are not interacting with the actual hardware, but the page cache. That means that almost all the I/O is done in an asynchronous fashion and close() is one way for you to get notified if there have been any errors.

Even with checking the return value of close(), you still need to take into account that errors will happen. Unless fsync() was called, the file system is free to take you writes to a close() file and just throw them away. This is not a theoretical issue, by the way, it happens quite often in many failure scenarios. The recommendation from the file system mailing list is to call fsync() and then close(), to get the highest durability mode.

Reading and write from the file

Now that we are able to create a file and allocate disk space for it, we need to tackle the next challenge, deciding how we are going to read and write from this file. I’ll defer talking about the internal organization of the file to the next chapter, for now, let’s talk about the low level interface that the PAL will offer to the rest of the system.

We already saw them in gavran/pal.h - The file system interface that we’ll consume for our storage engine, pal_mmap() and pal_unmap() as well as pal_write_file() and pal_read_file() are the functions that are provided for this purpose. There isn’t much there, which is quite surprising. There is a vast difference between the performance of reading from disk (even fast ones) and reading from memory. For this reason, database and storage engines typically spend quite a bit of time managing buffer pools and reducing the number of times they have to go to disk.

I’m going to use a really cool technique to avoid the issue entirely. By mapping the file into memory, I don’t have to write a buffer pool, I can use the page cache that already exists in the operating system. Using the system’s page cache has a lot of advantages. I have run into this idea for the first time when reading LMDB’s codebase and it is a fundamental property of how Voron (RavenDB’s storage engine) achieve its speed. I also recommend reading the "You’re Doing It Wrong" paper by Poul-Henning Kamp that goes into great details why this is a great idea.

The idea is that we’ll ask the operating system to mmap() the file into memory and we’ll be able to access the data through directly memory access. The operating system is in charge of the page cache, getting the right data to memory, etc. That is a lot of code that we don’t have to write, which has gone through literal decades of optimizations. In particular, the page cache implements strategies such as read ahead, automatic eviction as needed and many more behaviors that you’ll usually need to write in a buffer pool implementation. By leaning on the OS' virtual memory manager to do all that we gain enormous leverage.

We could map the memory for both reads and writes, but I believe that it would make more sense to only map the file data for reads. This is to avoid cases where we accidentally write over the file data in an unintended manner. Instead, we create an explicit call to write the data to the file: pal_write_file(). The pal_unmap() is just the other side of the pal_mmap() operation, allowing us to clean up after ourselves. I’m adding pal_read_file() for completion’s sake and because if we are running in 32 bits, we can’t really mmap() a 10 GB file, so we also need another fallback. You can see the mapping and un-mapping code in pal.linux.c - Implementing mapping and un-mapping of memory from out data file.

What is the best option for reading and writing, then? Go memory mapped I/O all the way or rely on read() and write()? And can we mix them? Why have two way to go about this? I’m mostly going to be using memory mapped I/O for reads, but writes will use pwrite() to write to it. That gives me the benefit of using the OS' buffer cache and make the best use of the system resources. Using pwrite, on the other hand, ensures that we aren’t accidentally writing beyond the end of the buffer and will help us catch any errors.

On Linux, that is safe to do, because both mmap() and the write() call are using the same page cache and are coherent with respect to one another. On Windows, on the other hand, that is not the case. Mixing file I/O calls and memory mapped files lead to situation where you write data using the I/O API which will take some time to be visible using the memory view. For further reading, you can read how I found out about this delightful state off affairs. When we get to the Windows side of things, we’ll show how to deal with this limitation properly.

pal.linux.c - Implementing mapping and un-mapping of memory from out data file

link:./code/pal.linux.c[role=include]

There really isn’t much there in pal.linux.c - Implementing mapping and un-mapping of memory from out data file. We just need to call mmap() or munmap() and do some basic error reporting in the case of an error. In pal.linux.c - Allowing to close() the file and unmap() via defer(), however, we allow to close() the file and unmap() the memory using a defer() call.

pal.linux.c - Allowing to close() the file and unmap() via defer()

link:./code/pal.linux.c[role=include]

In pal.linux.c - Writing and reading data to the file we have the implementation of writing and reading to files using normal file I/O. The pal_write_file() call is a simple wrapper around the pwrite() call, with the only difference being that we’ll repeat the write until the entire buffer has been written. In practice, this usually means that we’ll only do a single call to pwrite() which will perform all the work. And pal_read_file() does the same on top of pread(). The only difference there is that pal_read_file will consider partial writes as errors, so you either get the whole buffer you asked for, or an error.

pal.linux.c - Writing and reading data to the file

link:./code/pal.linux.c[role=include]

As much as I want to tunnel my writes I/O purely through the pwrite system call, there are going to be cases where we’ll want to enable writable memory map for the ease in which they allow us to make changes in memory. Supporting that is quite easy, using the mprotect() system call, you can see the code in pal.linux.c - Enable writable memory maps and allow to disable them.

pal.linux.c - Enable writable memory maps and allow to disable them

link:./code/pal.linux.c[role=include]

pal.linux.c - Enable writable memory maps and allow to disable them doesn’t define a pal_disable_writes() method, instead it jump directly to the defer call. You are expected to call defer(pal_disable_writes, span); immediately after calling ensure(pal_enable_writes(span). This API reflects the notion that writable memory mapped is meant to be rare.

Durable writes and `fsync()`

There is only one area of the API that we haven’t looked at yet. The pal_fsync() function is implemented just as simply as you would expect it to be. You can see it in pal.linux.c - Flushing a file.

pal.linux.c - Flushing a file

link:./code/pal.linux.c[role=include]

Actually, there is a surprise in pal.linux.c - Flushing a file, we aren’t calling fsync(), we are only calling fdatasync(). Note that elsewhere in the PAL, we used fsync(), in pal.linux.c - Calling fsync() on a parent directory to ensure that the file metadata has been preserved, for example. Why this discrepancy? And why we call the method pal_fsync() and not pal_fdatasync()?

When we call fsync(), we are asking to flush the data of the file to disk and the file’s metadata (last modified, access time, etc). When we call fdatasync(), we only ask to flush the file’s data, the metadata isn’t required to be included. That makes fdatasync() cheaper for scenarios where metadata updates aren’t important. Remember, fsync() and fdatasync() are expensive, so any reduction in cost is welcome.

I’m calling it pal_fsync() because the choice of fdatasync() is an implementation detail. On Mac, we’ll need to use fcntl(fd, F_FULLFSYNC); because fsync() doesn’t actually do what it is supposed to do. On Windows, we’ll call to FlushFileBuffers(), which is the fsync() equivalent. We''ll learn all about why fsync() is important to a storage engine in Chapter 8.

There is another aspect of durable writes that we have to look at. In pal.linux.c - Opening (or creating) a file I mentioned the flags argument and said that I’ll talk about it later. That later is now, and you can see the flags' options in gavran/pal.h - Flags for opening a file.

gavran/pal.h - Flags for opening a file

link:../include/gavran/pal.h[role=include]

There aren’t really many options here, we have durable mode and non durable mode, that is all. When looking at pal.linux.c - Opening (or creating) a file we can see that if we set the flags to pal_create_file() to pal_file_creation_flags_durable, it will simply add the following flags to the open() call: O_DIRECT | O_DSYNC. What does this mean?

Tip

The cost of fsync()

Using fsync(), we can ensure that writes to the disk has actually reached a stable medium. In other words, after fsync() was called, we can rest assured that a power failure won’t wipe our data. For a storage engine that, as you can imagine, this is a highly desirable property.

The issue is that fsync() is a very costly call. It is usually the most expensive call you’ll make, period. To the point where we’ll spend considerable time and effort down the line to reduce the number of times we have to call fsync(). The primary issue with fsync() is that it needs to clear not just the data in our file but to effectively flush the entire disk cache. If you have a lot of pending I/O, you will wait until this is completed, and that can take a while.

I’m going to cover this in more detail in Chapter 8, but the general idea is that O_DSYNC is saying to the operating system that on every write to the file, it needs to behave as if we called fdatasync() and O_DIRECT instructs the file system to ignore all buffers and use direct I/O. Together, however, they allow us to do something far more interesting. They tell the operating system that we want to make a durable write to disk, bypassing all buffering in the middle. This is a dramatically more expensive option, and using it in this manner requires that we’ll accept some harsh limitations on how we can actually write. All data must be page aligned, all writes must be page aligned, etc.

We’ll be using this mode sparingly, but it is a crucial one to ensuring that we can get proper durable writes in the system. You can see how they are used in Chapter 8. For now, we’ll just use the non durable mode everywhere.

Note

Using a block device instead of bothering with the file system

Technically speaking, the model that I intend to use will work just as well for raw block devices as it would do for files. Indeed, there are some real benefits of bypassing the file system for a storage engine. What I most want from a file system as a storage engine author is that it will get out of my way. There are also some performance benefits, avoiding the need for data fragmentation, overhead of the file system, etc.

That said, working with files is ever so much easier. Yes, you can use commands such as dd to move data between blocks and files, but that tend to be much more awkward than if the data reside in a file. In fact, we are going to try hard to get to the point where we have as few files as we can get away with and do all our work internally inside our file.

That would allow us to switch to a block device at a later point in time, but having direct access to the file is just too convenient to give up. However, there is another consideration to take into account. A database server can expect to have the right kind of permissions to allow opening a raw block device. But an embedded storage engine needs to deal with limited rights on behalf of its processes.

Using our API for storing and retrieving data

We are still very early on in the process, but I think that peeking at test.c - Using the file API to read & write will show you how far we have come. We are making for use of all of our functions to store and read data from the file.

test.c - Using the file API to read & write

link:./code/test.c[role=include]

test.c - Using the file API to read & write shows case the the benefits of defer again, which is rapidly becoming my favorite approach to dealing with error handling in this codebase. We have to deal with multiple resources, but there isn’t any jump in complexity in the function. Everything works.

As for the actual code, we create a new database, ensure that it has the right size, map it into memory and then write a value to it using the pal_write_file() function. We then read from it using the memory mapped address. The code reads nearly as well as the description of the code, with little in the way of unnecessary details.

This may seem like a humble beginning, but we are currently building the foundation of our storage engine. In the next chapter, we are going to start talking about how we are going to make use of this functionality.

Tests

In order to make sure that we are producing good software, and that means tests. Now that the chapter is over, it is time to see its tests. I’m not going to deeply into the testing side of the pool, but I have found them to be invaluable to ensure that the software does what you think it does. In this case, you already saw the major test cases, test.c - Using the file API to read & write and test.c - Using the file API to create a file with a minimum size, but I got a few that test more esoteric pieces of the API as well in test.c - Testing the PAL API.

A note on testing

On each chapter, I’m going to create a set of unit tests that will verify the functionality on the system. Each chapter conclude with a system that pass all its (and previous) tests. That is important to ensure that any change that I’m making isn’t going to break the system.

That said, be aware that you can spend a lot of time tying yourself to a particular implementation choices using tests. Right now I’m writing fairly silly tests, mostly because I’m not going to try to test that we are writing durably to disk. That requires a lab with a UPS that I can trigger via an API and a machine that I don’t mind frying.

And now we are ready to see how we are going to be using the files.

test.c - Testing the PAL API

link:./code/test.c[role=include]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ch03.adoc

ch03.adoc

Working with (and against) the file system

Creating a file handle

Setting the file’s size

Closing a file

Reading and write from the file

Durable writes and `fsync()`

Using our API for storing and retrieving data

Tests

Files

ch03.adoc

Latest commit

History

ch03.adoc

File metadata and controls

Working with (and against) the file system

Creating a file handle

Setting the file’s size

Closing a file

Reading and write from the file

Durable writes and fsync()

Using our API for storing and retrieving data

Tests

Durable writes and `fsync()`