Apparently this installer wants a title

Various ramblings from David Benjamin

dpkg: state of the listfile

with 3 comments

This is somewhat of an old story that happened back in the fall, but people often ask me what became of it, so it is probably worth recounting everything…

Have you ever attempted to install a new package on your favorite Debian/Ubuntu system only to find your install stalling on a curious “Reading database” step? Some time ago, I was investigating dpkg performance issues on an older implementation of Debathena’s login chroot. While the inherent problems of the previous LVM-based chroots have since been fixed with an aufs-based solution, the chief cause still remains: dpkg’s poor IO behavior.

An important job of a package manager is to answer the queries “where did this file come from?” and “what files did this package install?”. dpkg maintains this information in a very simple scheme. In /var/lib/dpkg/info (more specifically, the info subdirectory of the admindir, which may be set with the --admindir option or at compile-time) dpkg installs a listfile for every package: packagename.list. These listfiles contain a linebreak-separated list of every path owned by that package. A fairly effective format — the information needed is stored, and consistency is (in theory) easy to maintain with fsync, modulo the terrors of maintaining consistency on a filesystem — but for one flaw: no considerations whatsoever are given to the first of our two queries.

dpkg maintains no additional cache or index. Given a file, to obtain the package owning the file, dpkg loads in every listfile via ensure_packagefiles_available on every installed package and stuffs them into a giant global hash table. Once the table is built, lookup is easy, but building the table is extremely expensive. This is what “Reading database” means. In arbitrary order, dpkg sequentially reads each listfile from disk. On cold cache, this is an expensive process fraught with seeking.

To see just how expensive it is, one can experiment with dpkg -S. dpkg -S does little more than read the listfiles into memory (it implements the “given a file, find me the owning package” query), so it makes a perfect way to isolate the bottleneck.

On my machine,

# echo 3 > /proc/sys/vm/drop_caches
% time dpkg -S /bin/ls
coreutils: /bin/ls
dpkg -S /bin/ls  0.58s user 0.40s system 3% cpu 26.685 total

To prove that we are indeed bound by disk, run the command a second time on warm cache, and the results are much faster.

% time dpkg -S /bin/ls
coreutils: /bin/ls
dpkg -S /bin/ls  0.24s user 0.09s system 100% cpu 0.329 total

For so little data, this is hardly acceptable. Your typical off-the-shelf relational database will handle orders of magnitudes more data and still perform this query quickly (it’s a simple index lookup). In fact, I proposed on the list that dpkg add some form of database-backed index, or alternatively a dumber cache. (I wrote a proof-of-concept using a tar file.) Sadly, there was little interest. The internal abstractions of dpkg (what little there are) are poorly suited for a database or other index, so I would prefer not to start such a large project without the blessings of the dpkg team. (I did, however, end up getting two refactoring patches upstream in the process. Now Launchpad thinks I am active in dpkg.) Another developer had apparently in the past attempted a similar job and was shot down.

But there is still hope! A patch appeared on the dpkg developer list some time after I made my original proposal (I would like to think I had some influence in its making, but probably not) that bandaged the problem in a slightly less intrusive, but much more hackish way: perform a FIBMAP (later changed to FIEMAP) ioctl on every file and sort by first extent to obtain reading order. Of course, this actually wants asynchronous i/o, but this does not yet work to the best of my knowledge. As of writing, this patch has yet to be merged, but I expect it will be. dpkg people just merge patches very slowly.

The author claims a speedup from 27 seconds to 8 seconds on cold cache. (My original tar-based proof-of-concept manages a little over a second.) So, expect a slightly faster dpkg in the next few years. Possibly in time for M…ad Monkey? I’ll probably do a little happy post here when that finally gets merged. It somewhat saddens me that, should I ever get around to trying again, I will have a much harder time justifying the re-architecting necessary to introduce a proper index, but oh well.

And, in keeping with the previous post’s conclusion, how should this actually be done? Well, if we instead organize files under their owning package, both queries are trivial. Furthermore, the install process no longer even needs to make the once difficult query! dpkg requires the hash table of files primarily for file conflict detection, but in this setup, there is no such thing as file conflicts. It of course, opens other cans of worms, but that’s a topic for later.


Update: the FIEMAP patch has been merged. Of course, we really want proper asynchronous IO. Unfortunately, it looks like dpkg 1.15.6 includes fsync-related changes that hurt performance.

Written by davidben

February 17th, 2010 at 2:10 am

3 Responses to 'dpkg: state of the listfile'

Subscribe to comments with RSS or TrackBack to 'dpkg: state of the listfile'.

  1. Man, you remember Windows, where everything got its own C:\Program Files\FooProgram? Or better yet, back in the day when it was just C:\FOOPRO~1? Now if only it had a package manager…

    ageng

    17 Feb 10 at 2:32 am

  2. Hehe. Placing everything in C:\Program Files\FooProgram (I think it’s just Programs now in Vista) would be in some ways preferable. This structure is less strongly enforced, unlike, say, OS X app bundles, but I think it is a better organization.

    I believe Windows does have some sort of package manager that installers register their files with; certainly XP had some way of getting a list of files since it would take *forever* to compute the size of every program. My guess is that it was doing an equivalent of stat() on each flie. They also have something slightly more structured with .msi files, but I’ve gathered even less about those.

    davidben

    17 Feb 10 at 10:20 am

  3. Windows indeed has a central registry of installation-related information used for the Add/Remove Programs panel in Control Panel (now called “Programs and Features” from Vista onward, annoyingly moving it from the top of the list of panels to somewhere in the middle). However, this is just for uninstallation, and not all programs register with it.

    Jeffrey Bosboom

    1 May 10 at 4:29 pm

Leave a Reply