Mind the Small Stuff

I love audiobooks, and one audiobook I’ve been listening to lately is The Meaning of It All by Richard Feynman. In it, Feynman points out an unusual observation that is accounted for by relativity, but not by Newtonian motion. (I think he means the difference between invariant and relativistic mass for a top spinning very rapidly.) Feynman says that a top, spinning at relativistic speeds, would be observed to weigh more than the same top at rest. This doesn’t accord with Newton’s rules, so it suggests that the rules are wrong:

It turns out that the tiny effects that turn up always require the most revolutionary modifications of ideas.

(source.)

I think that this observation suggests something we should remember in software: Small bugs might be big problems. That nagging feeling when Method A really shouldn’t be able to return a negative number, but it did, just that once? Don’t ignore that. It might indicate a much larger problem.

Once I worked on a project where, once in a blue moon, a cache of records on the server would start throwing NullReferenceExceptions, and we would need to dump the cache. Then it would be fine for weeks, maybe a month, then it would start throwing exceptions again. We were surprised, because nothing ever inserted null into the list.

Since the .NET framework is open source now, I went digging in System.Collections.Generic.List source to see what I could see.

public void Add(T item) {
        if (_size == _items.Length) EnsureCapacity(_size + 1);
        _items[_size++] = item;
       _version++;
}

This list is implemented as a resizing array: Have a private Array[T], when someone adds something, put it in the first unused slot in the array, and increment a counter so we keep track of where the first unused slot is. If the array gets full, copy the contents to a bigger array..

We found that the bug was basically this: Two threads tried to insert into the list at the same time, which we thought was impossible. Thread 1 calls Add(), and increments _size. Then thread 2 calls Add(), increments _size, and puts an item at Array[_size]. Then Thread 1 overwrites Array[_size] with its value. The result is that Array[_size – 1] is an uninitialized reference. The next process to iterate the (supposedly null-free) list would throw a NullRefernceException.

The lazy way to fix this is to just put checks for nullity in the consumers of our list. That would be the equivalent of adding “But spinning tops are special at high speeds” to the end of Newtonian physics; it wouldn’t address the real problem. The code was trying to tell us something, and to ignore it would be incorrect.

The actual problem was our failure to use threadsafe collections in a situation when more than one thread could get at the collection. By using a threadsafe collection, we could address the real problem, which was our bad assumption about what thread could get where.

I think the lesson is this: In programming, as in physics, the weird little things on the edges, the corner cases and once-a-month bugs are trying to tell us something. We should listen to them.

Till next time, happy learning!

-Will

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s