Archive for March, 2010
Spring break is over. As expected, I didn’t manage to finish most of my projects for the week, but I did manage one of them: my laptop is is now running an install of KDE 4.4 parallel to the system 4.3 provided by Kubuntu.
Why I did this was described previously. Actually managing it was not that simple. We do not live in a perfect world, and indeed it’s silly to expect all of KDE to run without any root activity — any setuid portions, or global dbus configuration, for instance. Still, I wanted to try. For this and the next few posts, I’ll talk about the setup.
I have been managing software out of my home directory for quite some time now. To that end, I’ve built up a collection of functions in zsh, my primary shell. (There’s no particular reason why they’re in zsh; I just prefer it to bash.) They are inspired by the software lockers of MIT’s Athena system and the runtime setup of of Zero Install. At some point, I expect this system will converge to something that smells very much like part of Zero Install.
Any time I need some software which Ubuntu does not provide, I build it myself (or, if I’m lucky, find binary to unpack) isolated somewhere in my home directory. The current convention so far has been
~/pkg/PKGNAME for random software or
~/proj-build/PROJECTNAME for things I’m working on, but I’m not particular happy with this naming scheme. (It’s come up mostly by accident. I’ll likely move everything into
~/pkg or something.) Every locker contains approximately a UNIX directory tree.
A set of (fairly hacky) shell functions then inject subdirectories, as appropriate, into the environment when a locker is to be added. Unlike Zero Install, the variables are not specified by the locker. Instead, the shell script will look for, e.g.
bin, and add it to, e.g.
PATH, if it exists. This was mostly done out of laziness. At some point, variable choices will become the locker’s business. Current variables set include
There are two commands, borrowed from Athena, to add a locker to an environment. The first is
dir_run which runs a command with the given locker injected. The second is
dir_add which injects a locker into your current environment. I primarily use
dir_run with fancy completion scripts, but my dot files
dir_add any lockers which I use often or want injected into my standard environment.
So far, this setup has allowed me to run my system
okular on a development build of
popper when I hack on it. It’s allowed me to maintain a local build of git. It’s allowed me to parallel-install multiple snapshots of Chromium. It’s even allowed me to, via
dir_add, replace my system’s PyKerberos, when a bug in the packaged version prevented system software from using it. And, indeed, it allows me to run KDE out of my home directory.
Of course, building KDE for this wasn’t simply a matter of stuffing things into a folder and launching it. There were numerous problems along the way which I had to address, which I’ll describe in later posts.
If anyone wants my hacky scripts, they can be found in my athena Public. A disclaimer: they are hairy and very much need a cleanup. Also, they might need to tweaks to work well in bash; zsh lets me be lazy about quoting arguments. All that said, it’s sufficient for my needs and, despite being far from a true package management system, I think superior to anything
dpkg offers when it comes to maintaining different software configurations in parallel.
It’s no secret that I am unhappy with package management on Linux. One of these days, I’ll gather coherent enough thoughts on how things should work. In the meantime, here’s a glimpse at one of the biggest problems today.
If you look at the package management stacks in use on Linux today, be it apt/dpkg or yum/rpm or whatever, they share a fundamental assumption: there will only ever be one version of any package on the system. I argue that this mode of thinking is simply incorrect for a package manager on the free desktop. We need a package manager which fundamentally assumes parallel installation of packages. While correct parallel-install semantics are difficult, the flaws it fixes are well worth the effort.
One important use for parallel installation is testing. The user-base on any platform is different, and multiple configurations should be tested. One possibility is to use a separate machine, but this is painful. Indeed, Microsoft has not solved this problem; web designers wishing to run IE 6 and 7 concurrently were recommended to use a virtual machine.
When I used Windows, I used Portable Firefox for this. On Linux, I similarly download the official tarballs and run them out of my home directory, taking care not to eat my profile. But why should I manually manage this when I have a state-of-the-art (if the rumors on every Ubuntu advocate’s top 10 lists are true) package manager on my system!
Related to the needs of testing environments is the ability to fall-back when software breaks. for instance, my current browser of choice is Chrome. Now, Google provides an apt repository for Chrome, and yet I use the Chromium nightlies. The apt repositories force me into a single-install setup. Chrome is a very fast-moving target, and things sometimes break. Yet, I appreciate the movement as features I require quickly get added, such as client-side certificates. There is a simple solution with parallel installation: I keep around my old version when updating to a new build. If the new one proves unstable, I just revert to using the old one.
(These days things are less unstable than before. Should Youtube’s HTML5 video fix its quality problems, I’ll likely start using Chrome proper. Of course, my Chromium setup still parallel-installs, so I can rollback at will.)
But why bother? I can just uninstall the new version and install the old version. In practice, this doesn’t work. A month or so ago, KDE 4.4 was released. I, being the avid KDE user that I am, was eager to try it. Well, Kubuntu offered backported packages… why not? I can always go back to 4.3 if I wanted, right? To make a long story short, no. When I rolled back, dpkg and apt got woefully confused and I reinstalled most of the software on my machine. I am now in the process of creating a KDE 4.4 to run out of my home directory. When the last few remaining kinks are ironed out, I’ll describe the setup.
As they say, the best code is code you don’t have to write. Likewise, the best rollback procedure is one you don’t have to perform.
Finally, parallel installation acknowledges a fundamental fact of library compatibility: no two different pieces of software are completely compatible. Distributions love to force every package to use the same copy of every library. Most of the time, this is a sound and sensible goal. But it often falls short of reality. Even if the author of a library is very careful about keeping API and ABI working, programs may depend on subtle effects.
Take, for instance, this hypothetical situation. libfoo has a bug which causes some functionality of bar to fail. Bar eventually diagnoses this and perhaps even sends a patch to libfoo. In the meantime, bar should still work, so bar adds a workaround for this bug. This is, sadly, not compatible with the fix, so the workaround is conditionalized on libfoo’s version. Now, a distributor comes along, packages an older libfoo for stability, but backports the fix. Now bar fails to work on that distribution. Think I’m exaggerating? Search for “Debian” on these Eclipse release notes. Indeed, their solution is to install a different version of GTK+. Would it not be better if we could parallel install GTK+ and only use this specially crafted one for Eclipse?
Sometimes a package may even be incompatible with itself. SQLite has countless incompatible build options. The only possible solution is for every program to bundle its own SQLite in parallel.
Linux package managers of today are inadequate for supporting a platform for developers, content producers, and users. We need package managers which allow as much of the system as possible to be parallel-installed to support the evolving, disorganized nature of the Linux desktop.
Often when solving a problem, a useful approach is to do some local work which then reduces to a smaller instance. Sometimes this smaller instance can be complex to build. In the case of Kruskal’s algorithm for finding a minimum spanning tree, we pick edges and recurse over a reduced graph. Specifically, we identify two vertices as “the same”, merge them together and continue. Well, we can literally merge the vertices, but if vertices have many degrees or the wrong edges are picked, it is easy to get a needless O(N2) or the like. We would like to do this more efficiently.
Having a dynamic notion of “sameness” is common in many problems. Quite conveniently, there is a data structure that allows one to track this efficiently, called union-find or the disjoint-set data structure.
The spec of union-find is as follows: Initially, we have N objects and N buckets, with object i in bucket i. We then provide two operations
find takes an object and returns the bucket it lives in.
union takes two buckets and merges them into one. Naively, implementing this efficiently isn’t trivial. To implement
union, we must alter state of many objects at once. To resolve this, we do what is often done in data structures and lazily post-pone work done on writes to read time.
For every object, we create a node with a single pointer coming out. This pointer means “I am in the same bucket as”. A self pointer means that this object is also a bucket (or they are the representative element of their bucket). Initially, our structure looks like this:
To union buckets, we take take their representative elements and hang one under the other. For instance, we were to union 1 with 2, and then 3 and 5 with 4, it may look like this:
Note that these are trees, but with only parent pointers. To find, we just walk up the structure until we reach a bucket and return. However, we’ve offloaded perhaps too much work at read time. It is very easy to reach this configuration, with an arbitrarily long chain:
A find of 5 will take linear time — unacceptable. The whole point of this exercise is to avoid linear queries. However, we may add a simple optimization. When unioning two trees, we have a choice: either the first hangs below the second or vice versa. It is better to hang the shorter one underneath the taller one, to avoid increasing the height. (If the two trees are of equal height, we must increase by 1, so pick arbitrarily.) To manage this, we do a little book-keeping and maintain the tree of each node’s subtree. This is called union by rank. It turns out that this alone will guarantee balanced trees. Each operation will run in time O(lg N). That’s already pretty good, but we can do better!
Consider a fairly expensive find. We may traverse something like this:
So, now we’ve learned that
find(5) = 1. Now, I ask you this again. And again. And again. Each time, we walk up the tree. Why bother, when you can cache? We can just repoint 5 from 4 all the way to 1. But why stop there? We traversed 4, so why don’t we rewrite its pointers?
By rewriting every edge along any path we walk, we never traverse a path twice. By paying for one expensive
find of 5, we speed up future queries to anyone on the path. Furthermore, queries to anyone underneath those nodes (7, 8, and 9) get sped up too.
This optimization is called path compression. By a fairly involved proof, one can show that, using path compression alone, we obtain amortized O(lg N) performance.
So what if we put them together? With both path compression and union by rank, we bring the time down to O(α(N)) where α is the inverse Ackermann function. This is a ridiculously fast-growing function. A(4, 4) is 2265,536, which is far bigger than any number you will care about. Furthermore, you can prove that any implementation of this must make O(α(N)) operations amoritized. So, not only is this fancy expression for O(4) possible, but is its optimal. It’s also extremely simple; a couple dozen lines of your favorite language will suffice.