Archive for May, 2010
KDE SVN Access
As of some time last week, I am now a KDE committer. So far, I haven’t made a useful commit yet, but hopefully I’ll have time over the summer. KHelpCenter could use a lot of love, and I’d like to improve the startup time of various applications.
BarnOwl, ncurses, and terminal resizing
Earlier this week, I hunted down a delicious little bug in BarnOwl. A friend of mine, Alex, reported occasionally segmentation faults in his sessions. This week, we finally got a core dump to inspect.
Off-screen windows
Inspecting the backtrace, we see the crash occurred in the ncurses function wredrawln, called by our owl_function_full_redisplay. (Actually, we call redrawwin which is a macro wrapper over wredrawln.)
Now, one of many causes for crashes in ncurses is off-screen windows. It’s not clear what the state of support for it is, as the code does check for in some cases, and newwin will allow you to create them. That said, you cannot move them off-screen with mvwin, and parts of ncurses may crash.
In fact, owl_function_full_redisplay had the following code in place
/* Work around curses segfualts with windows off the screen */
if (g.lines >= owl_global_get_typwin_lines(&g)+2)
redrawwin(owl_global_get_curs_typwin(&g));
Furthermore, it was that call to redrawwin that crashed. (The typwin is the typing area; the name is a holdover from the Owl days.) Perhaps the check wasn’t sufficient and we got an off-screen window again. So, some gdb dances later:
(gdb) p g.lines
$1 = 45
[...]
(gdb) print *(g->typpan->win)
$5 = {_cury = 0, _curx = 0, _maxy = 7, _maxx = 176, _begy = 37, _begx = 0, [...]}
(g holds much of BarnOwl’s state. Again, a remnant from Owl.) Well, 7 + 37 = 44 = 45 - 1, so that looks fine. But wait!
(gdb) print *((WINDOW*) curscr)
$2 = {_cury = 37, _curx = 0, _maxy = 43, _maxx = 176, _begy = 0, _begx = 0, [...]}
Ncurses thinks the height of the window is 43 + 1 = 44, one less than BarnOwl’s 45. So the window is off-screen. That triggers the aforementioned wredrawln crash.
Diving into ncurses
Clearly, this crash in ncurses shouldn’t happen in the first place. So let’s take a brief foray into the tangled forests of ncurses source. The function in question is implemented in ncurses/base/lib_redrawln.c. Here is a snippet of it:
end = beg + num;
if (end > curscr->_maxy + 1)
end = curscr->_maxy + 1;
if (end > win->_maxy + 1)
end = win->_maxy + 1;
len = (win->_maxx + 1);
if (len > (size_t) (curscr->_maxx + 1))
len = (size_t) (curscr->_maxx + 1);
len *= sizeof(curscr->_line[0].text[0]);
for (i = beg; i < end; i++) {
int crow = i + win->_begy;
memset(curscr->_line[crow].text + win->_begx, 0, len);
_nc_make_oldhash(crow);
}
Well, that’s odd. The library is already careful to clamp the dimensions by the screen size, so we shouldn’t have a problem. But win->_beg{x,y} is added as an offset. Those checks only work on windows flush with the upper-left corner. The typwin is not, so they fail.
Ncurses resizing
Ncurses problems not withstanding, we’re not finished. The main problem is yet to be addressed: How did BarnOwl and ncurses disagree on the window size? Before that, let’s discuss how ncurses resize is handled. There is an ioctl for querying the size of a terminal, TIOCGWINSZ (as well as a handful of other mechanisms; the ncurses source code tries a ton of different ones). But we shouldn’t poll on that value, so there is a signal SIGWINCH which informs a process of a changed terminal size.
How do you react to a SIGWINCH? The traditional way was to do a endwin/doupdate pair. This is flickery, so there is an extension to curses, resizeterm which forces the resize work directly. (In ncurses, the former method is implemented using resize_term.) If you don’t register your own SIGWINCH handler, ncurses will do so on initscr, but then it’s very hard to atomically relayout and resize. (Ncurses doesn’t know enough about the application to handle resizes in full: a textbook violation of the end-to-end principle.)
Reproducing the bug
Most people run BarnOwl inside a screen session, so let’s (incorrectly) guess that screen was failing to send the signal on attach. Aided by the backtrace, we open BarnOwl in screen with a popup open. We then detach, resize the terminal, reattach, close the popup, and BOOM! It crashes! That was easy.
Unfortunately, none of the other developers were able to reproduce this. Also, this theory doesn’t account for ncurses and BarnOwl getting different values as the same signal handler updates both. Furthermore, the bug isn’t always hit, so there’s a race condition happening somewhere.
Tracing
Stepping back a bit, there are three sets of window sizes here. The first is the actual terminal size. Then, we have ncurses’ record of the size (from which the size of the screen buffer is determined). Finally, we have BarnOwl’s own record of the size (from which the window layout is determined). Playing with signals will allow us to synchronize the first with the other two. The crash we’re interested in comes from a discrepancy between between the latter. But BarnOwl is the only one calling resizeterm, so how can a discrepancy arise?
Quite conveniently, ncurses has a trace feature which is indispensable in debugging ncurses applications. By examining trace data, we notice something strange: when the bug is triggered BarnOwl runs its SIGWINCH handler, but ncurses enters resizeterm twice.
So who made the second call. If you were reading carefully, you may already know the answer: doupdate. BarnOwl’s resize code not only called a resizeterm, but it also calls endwin.
if (!isendwin()) {
endwin();
}
/* get the new size */
ioctl(STDIN_FILENO, TIOCGWINSZ, &size);
// [...]
#ifdef HAVE_RESIZETERM
resizeterm(size.ws_row, size.ws_col);
#endif
Because we call an endwin, the next screen update (the doupdate at the end of the event loop iteration) triggers ncurses’ internal resize routine. It then does its own ioctl and finds the size. Now, if we change terminal size twice in a row, such that the second happens while we are still reacting to the first, ncurses’ later ioctl will return different values than BarnOwl’s query. As a result, we get our inconsistency.
Where does this quick resize come from? Screen has this feature to enable a status bar. Both Alex and I happen to have them enabled. Screen has (arguably) a bug where it resizes the window twice in quick succession; the second time to add a status bar. This also explains why the numbers of lines were off by exactly 1. And that is final piece of the puzzle.
Summary
To help put these pieces together, I made a diagram. A fairly complex interaction between BarnOwl, ncurses, and screen was required here.
Closing a popup window eventually calls redrawwin on a number of windows.
This call is rather pointless. In BarnOwl 1.6, I landed some code to use the libpanel library to manage overlapping windows. The much safer touchwin can be used instead of redrawwin, and owl_popwin_down has no need to repaint the screen anyway.
wredrawln checks boundaries incorrectly
I sent a patch upstream to correct this. It has been incorporated into ncurses-5.7-20100501.
Screen sends two SIGWINCHs in succession
Screen’s code is really scary. I’m not touching that one.
BarnOwl’s resize handler calls both endwin and resizeterm
This one is actually partially my fault. As part of the libpanel changes, I removed many of the extraneous explicit screen updates, including a refresh right after the endwin. This has now been changed. We no longer call endwin at all and require resizeterm. (As we already required ncurses for Unicode, this is not actually a change in dependencies.) We should also finally have flicker-free resizing now.
Had even one of these bugs not occurred, it is quite likely that we would never have noticed. But they all did and meshed together to form a crash. Some failures are simple and fairly easy to debug. Some are not. Sometimes, you have to dig quite far to understand the motions of all the players on the board.