Storing a Million Files in a Tiny Box: An Introduction to Filesystems
One of the more common pieces of idle chit-chat people share when attempting to make small talk is the simple-seeming question: So what do you do for a living? Those of you who also work in software will know how quickly answers to this question can go awry when talking to people who aren’t technologically savvy. Recently, I was trying to explain what I’ve been working on lately to my mother. She’s a smart woman, but she has relatively little background in computers and how they work. This is a slightly better version of the analogy I came up with:
The filesystem is code that tells your computer how to store all your files. Think of your computer like a vast1 library. The books are your files on your computer. The shelves and cases and stacks where they’re kept in the library are your hard drive. The filesystem, then, is the team of very helpful librarians who run the place.
They have five basic jobs:
- Lookup: finding a given book in the library, or know if it’s in the library at all.
- Reading: fetching the book from a given place.
- Allocation: locating a place for any new books they’re given.
- Writing: putting a book in a given place on the shelves.
- Administration: opening and closing the library; talking to patrons (programs wanting to use your files); transmitting information when everything catches fire.
This, as it turns out, is not a bad analogy because filesystems and libraries have a lot in common. They store vast amounts of data, they’re responsible for handing it out and taking care of it, and they need to be relatively efficient about the whole process. Today, we’re going to talk about how filesystems do allocation, what the goals and constraints are, what some of the common solutions are, and how we can make things even better.
In the library-filesystem analogy, a file is a book. Even if it contains a picture, or a song, or a gif of a cat failing to jump onto something, it’s a book. Just like books, files are made of pages2 (though they’re usually called blocks or sectors). Some files are just one page; others can be many millions. Unlike normal books, files don’t have to stay in one piece when you’re storing them; you can put the first ten pages on one shelf, and then the next ten somewhere completely different if you need to. Of course, if someone asks for the first twenty pages of the book, you’ll have to do some extra walking to fetch them from both places, so it’s usually a good idea to try to keep the pages together as much as possible.
In addition, unlike a library, books are constantly being added and rewritten, so it’s very important that the librarians be quickly able to find a place for any new or updated books. The simplest way to find a new place for a book is to just wander around until you find a place to put it. Unfortunately, modern hard drives are so big that this would be extremely slow3. What we need is some way to keep track of what space is in use and what space is free that we can use to quickly find a good spot to put a new file in. We’ll call this idea a space map.
A space map is a special file in the filesystem that keeps track of what space is in use and what space is available. There are many ways to organize the data in this file to be as compact as possible while still being easy to use. Some modern filesystem store a bitmap; each spot on the shelf is represented by a single 1 or 0—a 1 means it’s in use, and a 0 means it’s free. Since you only need a single bit for each page, you can store a lot of map information in a relatively small file.
However, when you need to update the space map with new information, the updates will tend to be randomly scattered across the whole file, which causes performance problems. Another option is to store a list of all the allocated spaces. This works great if you tend to have long stretches of allocated or unallocated space since (almost) no matter how large the contiguous chunk is, it can be stored in a single entry. It works less well if the average chunk size is extremely small. This still has the problem with updates causing a lot of randomly placed writes within the space map.
This problem can be solved for both methods with something called a “log-structured space map.” The idea is that you store the normal space map, and then at the end of it you record all the changes you want to make to it in a log. When the log gets to a certain size, you combine it with the normal space map and rewrite the map all at once. This way, all your updates are usually in a few blocks at the end of the file—except every so often when you go through and update the whole map.
These maps work great in theory, but having to go to the shelves to look up where there are places to put things every time you need to do an allocation can get very slow. This is where the system’s main memory comes into play, with in-memory space map representations.
If the shelves of the library are your computer’s hard drive, the librarians’ desks are the memory in use by the filesystem. This space can be used to store data that’s being accessed frequently, or store important documents for the librarians themselves. One such document could be the space map. However, even as efficient as these files are, the space map for a whole library is usually too large to conveniently keep in memory. As a result, most filesystems break up the space map into chunks, each representing a different part of the hard drive.
Think of it like having a space map for each row of bookshelves. It’s still useful to have one, but it’s of a manageable size. However, we still need to read through the whole space map looking for a good spot, and the whole log as well. We need a more efficient way to store the map while it’s in memory. We’re ok doing lots of random updates since we don’t have to travel very far to do them, but we don’t want to have to take time proportional to the size of the whole map to look for a place to put data.
One solution to this problem is called a range tree. This is a data structure that stores a list of all the free spaces on the disk in a quick and easy way to access and update. You can also extend them to make them very good at finding free spaces of a specific size, so you can quickly find a place that your book will fit into.
Putting it all Together
Let’s walk through a typical allocation in a filesystem using the information we’ve learned.
- First, we go through each of our loaded range trees looking for a space large enough to store the allocation. If we don’t find one, we load a new range tree by fetching the space map from the disk, and then build it into a range tree.
- We repeat this over and over until we find a place to put the allocation or determine that there is no place big enough. If there’s no big enough place, we can split the allocation into smaller chunks and start from the top.
- Once we have our place to put the allocation, we remove that space from the appropriate range tree and write out that change to the space map for this part of the disk4.
You probably have a few lingering unanswered questions, such as:
- When we split our space map into chunks, didn’t we reintroduce the problem of having to do lots of writes to different places that we tried to fix with the space map log?
- How do you pick which chunks to consider first for allocation, and how do you pick which ones to load first?
- How do you efficiently store and manage your in-memory range trees?
For more technical details about how this applies to ZFS, register for the 2019 OpenZFS Developer Summit and come check out my talk, or watch it online!
1 Vast is the right word; if each page of your average book stores 2000 letters on it, and we think of each letter as being the same as a byte of computer data, then a .1mm thick page stores about 4000 bytes of data (conveniently, almost the same as a modern disk sector!). That means you need about 17 miles of bookshelf to store as much data as a modest 1 TiB hard drive. Even with 6 rows of books on a bookshelf, with walking space between the shelves it takes up an amount of space pretty close to an american football field. And that’s for a single small hard drive!
2 Books have covers, in addition to just pages. You can extend the analogy to give files “covers”, but it’s not especially helpful and just distracts from the goal here, which is to talk about allocation.
3 Plus, in a filesystem, there is no simple way to accurately say “this block isn’t in use” just by looking at it; a block filled with zeroes might have been written that way, and changes to it would affect the file it’s a part of. There are workarounds for this, but they require some extra work, and none of them solve the “it would be really slow” problem.
4 Careful readers might note that this step may require another allocation (if we need a new block in the space map log)! Thankfully this is relatively rare, since a single block of the space map log can store a large number of updates. Even if we do need a new block to write this allocation into, that allocation will again probably fit nicely into an existing block, and so on.