This is somewhat of an old story that happened back in the fall, but people often ask me what became of it, so it is probably worth recounting everything…
Have you ever attempted to install a new package on your favorite Debian/Ubuntu system only to find your install stalling on a curious “Reading database” step? Some time ago, I was investigating dpkg performance issues on an older implementation of Debathena’s login chroot. While the inherent problems of the previous LVM-based chroots have since been fixed with an aufs-based solution, the chief cause still remains: dpkg’s poor IO behavior.
An important job of a package manager is to answer the queries “where did this file come from?” and “what files did this package install?”. dpkg maintains this information in a very simple scheme. In
/var/lib/dpkg/info (more specifically, the
info subdirectory of the admindir, which may be set with the
--admindir option or at compile-time) dpkg installs a listfile for every package:
packagename.list. These listfiles contain a linebreak-separated list of every path owned by that package. A fairly effective format — the information needed is stored, and consistency is (in theory) easy to maintain with
fsync, modulo the terrors of maintaining consistency on a filesystem — but for one flaw: no considerations whatsoever are given to the first of our two queries.
dpkg maintains no additional cache or index. Given a file, to obtain the package owning the file, dpkg loads in every listfile via
ensure_packagefiles_available on every installed package and stuffs them into a giant global hash table. Once the table is built, lookup is easy, but building the table is extremely expensive. This is what “Reading database” means. In arbitrary order, dpkg sequentially reads each listfile from disk. On cold cache, this is an expensive process fraught with seeking.
To see just how expensive it is, one can experiment with
dpkg -S does little more than read the listfiles into memory (it implements the “given a file, find me the owning package” query), so it makes a perfect way to isolate the bottleneck.
On my machine,
# echo 3 > /proc/sys/vm/drop_caches % time dpkg -S /bin/ls coreutils: /bin/ls dpkg -S /bin/ls 0.58s user 0.40s system 3% cpu 26.685 total
To prove that we are indeed bound by disk, run the command a second time on warm cache, and the results are much faster.
% time dpkg -S /bin/ls coreutils: /bin/ls dpkg -S /bin/ls 0.24s user 0.09s system 100% cpu 0.329 total
For so little data, this is hardly acceptable. Your typical off-the-shelf relational database will handle orders of magnitudes more data and still perform this query quickly (it’s a simple index lookup). In fact, I proposed on the list that dpkg add some form of database-backed index, or alternatively a dumber cache. (I wrote a proof-of-concept using a tar file.) Sadly, there was little interest. The internal abstractions of dpkg (what little there are) are poorly suited for a database or other index, so I would prefer not to start such a large project without the blessings of the dpkg team. (I did, however, end up getting two refactoring patches upstream in the process. Now Launchpad thinks I am active in dpkg.) Another developer had apparently in the past attempted a similar job and was shot down.
But there is still hope! A patch appeared on the dpkg developer list some time after I made my original proposal (I would like to think I had some influence in its making, but probably not) that bandaged the problem in a slightly less intrusive, but much more hackish way: perform a
FIBMAP (later changed to
FIEMAP) ioctl on every file and sort by first extent to obtain reading order. Of course, this actually wants asynchronous i/o, but this does not yet work to the best of my knowledge. As of writing, this patch has yet to be merged, but I expect it will be. dpkg people just merge patches very slowly.
The author claims a speedup from 27 seconds to 8 seconds on cold cache. (My original tar-based proof-of-concept manages a little over a second.) So, expect a slightly faster dpkg in the next few years. Possibly in time for M…ad Monkey? I’ll probably do a little happy post here when that finally gets merged. It somewhat saddens me that, should I ever get around to trying again, I will have a much harder time justifying the re-architecting necessary to introduce a proper index, but oh well.
And, in keeping with the previous post’s conclusion, how should this actually be done? Well, if we instead organize files under their owning package, both queries are trivial. Furthermore, the install process no longer even needs to make the once difficult query! dpkg requires the hash table of files primarily for file conflict detection, but in this setup, there is no such thing as file conflicts. It of course, opens other cans of worms, but that’s a topic for later.