_posts/2014-02-02-The-worst-bugs.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182

---
layout: post
title: The bug that hides from breakpoints
tags: [KnightOS, kernel hacking]
---

This is the story of the most difficult bug I ever had to solve. See if you can
figure it out before the conclusion.

### Background

For some years now, I've worked on a kernel for Texas
Instruments calculators called [KnightOS](https://github.com/KnightOS/kernel).
This kernel is written entirely in assembly, and targets the old-school z80
processor from back in 1976. This classic processor was built without any
concept of protection rings. It's an 8-bit processor, with 150-some instructions
and (in this application) 32K of RAM and 32K of Flash. This stuff is so old, I
ended up writing most of the KnightOS toolchain from scratch rather than try to
get archaic assemblers and compilers running on modern systems.

When you're working in an enviornment like this, there's no seperation between
kernel and userland. All "userspace" programs run as root, and crashing the entire
system is a simple task. All the memory my kernel sets aside for the
process table, or memory ownership, file handles, stacks, any other executing
process - any program can modify this freely. Of course, we have to rely on the
userland to play nice, and it usually does. But when there are bugs, they can be a
real pain in the ass to hunt down.

### The elusive bug

The original bug report: **When running the counting demo and switching between
applications, the thread list graphics become corrupted.**

I can reproduce this problem, so I settle into my development enviornment and I
set a breakpoint near the thread list's graphical code. I fire up the emulator and
repeat the steps... but it doesn't happen. This happened consistently: **the bug
was not reproduceable when a breakpoint was set**. Keep in mind, I'm running this
in a z80 emulator, so the enviornment is supposedly no different. There's no
debugger attached here.

Though this is quite strange, I don't immediately despair. I try instead setting a
"breakpoint" by dropping an infinite loop in the code, instead of a formal
breakpoint. I figure that I can halt the program flow manually and open the
debugger to inspect the problem. However, the bug wouldn't be tamed quite so
easily. The bug was unreproducable when I had this psuedo-breakpoint in place,
too.

At this point, I started to get a little frustrated. How do I debug a problem that
disappears when you debug it? I decided to try and find out what caused it after
it had taken place, by setting the breakpoint to be hit only after the graphical
corruption happened. Here, I gained some ground. I was able to reproduce it, and
*then* halt the machine, and I could examine memory and such after the bug was
given a chance to have its way over the system.

I discovered the reason the graphics were being corrupted. The kernel kept the
length of the process table at a fixed address. The thread list, in order to draw
the list of active threads, looks to this value to determine how many threads it
should draw. Well, when the bug occured, the value was too high! The thread list
was drawing threads that did not exist, and the text rendering puked non-ASCII
characters all over the display. But why was that value being corrupted?

It was an oddly specific address to change. None of the surrounding memory was
touched. Making it even more odd was the very specific conditions this happened
under - only when the counting demo was running. I asked myself, "what makes the
counting demo unique?" It hit me after a moment of thought. The counting demo
existed to demonstrate non-supsendable threads. The kernel would stop executing
threads (or "suspend" them) when they lost focus, in an attempt to keep the
system's very limited resources available. The counting demo was marked as
non-suspendable, a feature that had been implemented a few months prior. It
showed a number on the screen that counted up forever, and the idea was that you
could go give some other application focus, come back, and the number would have
been counting up while you were away. A background task, if you will.

A more accurate description of the bug emerged: "the length of the kernel process
table gets corrupted when launching the thread list when a non-suspendable thread
is running". What followed was hours and hours of crawling through the hundreds of
lines of assembly between summoning the thread list, and actually seeing it. I'll
spare you the details, because they are very boring. We'll pick the story back up
at the point where I had isolated the area in which it occured: applib.

The KnightOS userland offered "applib", a library of common functions applications
would need to get the general UX of the system. Among these was the function
`applibGetKey`, which was a wrapper around the kernel's `getKey` function. The
idea was that it would work the same way (return the last key pressed), but for
special keys, it would do the appropriate action for you. For example, if you
pressed the F5 key, it would suspend the current thread and launch the thread
list. This is the mechanism with which most applications transfer control out of
their own thread and into the thread list.

Eager that I had found the source of the issue, I placed a breakpoint nearby. That
same issue from before struck again - the bug vanished when the breakpoint was
set. I tried a more creative approach: instead of using a proper breakpoint, I
asked the emulator to halt whenever that address was written to. Even still - the
bug hid itself whenever this happened.

I decided to dive into the kernel's getKey function. Here's the start of the
function, as it appeared at the time:

{% highlight nasm %}
getKey:
    call hasKeypadLock
    jr _
    xor a
    ret
_:  push bc
; ...
{% endhighlight %}

I started going through this code line-by-line, trying to see if there was
anything here that could concievably touch the thread table. I noticed a minor
error here, and corrected it without thinking:

{% highlight nasm %}
getKey:
    call hasKeypadLock
    jr z, _
    xor a
    ret
_:  push bc
; ...
{% endhighlight %}

The simple error I had corrected: getKey was pressing forward, even when the
current thread didn't have control of the keyboard hardware. This was a silly
error - only two characters were omitted.

A moment after I fixed that issue, the answer set in - this was the source of the
entire problem. Confirming it, I booted up the emulator with this change applied
and the bug was indeed resolved.

Can you guess what happened here? Here's the other piece of the puzzle to help you
out, translated more or less into C for readability:

{% highlight c %}
int applibGetKey() {
    int key = getKey();
    if (key == KEY_F5) {
        launch_threadlist();
        suspend_thread();
    }
    return key;
}
{% endhighlight %}

Two more details you might not have picked up on:

* applibGetKey is non-blocking
* suspend_thread suspends the current thread immediately, so it doesn't return until the
  thread resumes.

### The bug, uncovered

Here's what actually happened. For most threads (the suspendable kind), that
thread stops processing when `suspend_thread()` is called. The usually
non-blocking applibGetKey function blocks until the thread is resumed in this
scenario. However, the counting demo was *non-suspendable*. The suspend_thread
function has no effect, by design. So, suspend_thread did not block, and the
keypress was returned straight away. By this point, the thread list had launched
properly and it was given control of the keyboard.

However, the counting demo went back into its main loop, and started calling
applibGetKey again. Since the average user's finger remained pressed against the
button for a few moments more, applibGetKey *continued to launch the thread list,
over and over*. The thread list itself is a special thread, and it doesn't
actually have a user-friendly name. It was designed to ignore itself when it drew
the active threads. However, it was *not* designed to ignore other instances of
itself, the reason being that there would never be two of them running at once.
When attempting to draw these other instances, the thread list started rendering
text that wasn't there, causing the corruption.

This bug vanished whenever I set a breakpoint because it would halt the system's
keyboard processing logic. I lifted my finger from the key before allowing it to
move on.

The solution was to make the kernel's getKey function respect hardware locks by
fixing that simple, two-character typo. That way, the counting demo, which had no
right to know what keys were being pressed, would not know that they key was still
being pressed.

The debugging described by this blog post took approximately three weeks.

[Discussion on Hacker News](https://news.ycombinator.com/item?id=7688700)