Taiichi Ohno (via Toyota Europe)
Tracking down and fixing bugs is hard. A non-developer might assume the opposite — if there’s an issue that the client/end user sees, the developer just needs to “go in and fix it.” The reality, though, is that the process of isolating the actual issue that needs correcting can often be rather winding — full of dead ends and false starts.
In fact, if crafting an elegant solution to a complex problem is the pinnacle of programming joy, then trying to fix a bug that makes no sense is probably the low point. Andrew Hunt and David Thomas devote a whole chapter to the art of debugging in their book, The Pragmatic Programmer. They present perhaps the best reasoning I’ve heard of why this skill is so elusive to so many very smart and talented people:
Before you start debugging, it’s important to adopt the right mindset. You need to turn off many of the defenses you use each day to protect your ego, tune out any project pressures you may be under, and get yourself comfortable.
It’s easy to get into a panic, especially if you are facing a deadline, or have a nervous boss or client breathing down your neck while you are trying to find the cause of the bug. But it is very important to step back a pace, and actually think about what could be causing the symptoms that you believe indicate a bug.
If your first reaction on witnessing a bug or seeing a bug report is “that’s impossible,” you are plainly wrong. Don’t waste a single neuron on the train of thought that begins “but that can’t happen” because quite clearly it can, and has.
Beware of myopia when debugging. Resist the urge to fix just the symptoms you see: it is more likely that the actual fault may be several steps removed from what you are observing, and may involve a number of other related things. Always try to discover the root cause of a problem, not just this particular appearance of it.
Take a deep breath and THINK! about what could be causing the bug.
Taiichi Ohno and the Five Whys
All of this brings me to Taiichi Ohno, an executive at the Toyota Motor Company in the years following the Second World War. Those familiar with the philosophy of the Toyota Production System will know him as the pioneer of many aspects of Kaizen, the system of continuously improving manufacturing processes, particularly the principle of “5 Whys.” As he described it, “Observe the production floor without preconceptions. Ask ‘why’ five times about every matter.”
Wikipedia gives a hypothetical automotive problem, which looks like this:
- The vehicle will not start. (the problem)
- Why? – The battery is dead. (first why)
- Why? – The alternator is not functioning. (second why)
- Why? – The alternator belt has broken. (third why)
- Why? – The alternator belt was well beyond its useful service life and not replaced. (fourth why)
- Why? – The vehicle was not maintained according to the recommended service schedule. (fifth why, a root cause)
- Why? – Replacement parts are not available because of the extreme age of the vehicle. (sixth why)
- Start maintaining the vehicle according to the recommended service schedule. (possible 5th Why #1)
- Adapt a similar car part to the car. (possible 5th Why #2)
The Toyota Production System has served as the foundation for many modern business philosophies, perhaps most notably Six Sigma, a set of manufacturing methodologies first developed by Motorola. My interest, though, is in how close this process of constantly asking “why” mirrors the process that developers can and should be going through whenever the impossible bug presents itself.
The Five Whys in Action
Take, for instance, this simple real-world scenario we were recently faced with: On a certain website, there is an “office locator” page. There, an interface allows the user to first select a state, then a city, and finally see locations as markers on a map. Recently, the client reported that suddenly several states, when selected, were not triggering the list of cities to display. Nothing related to this functionality had changed. There were no issues on our staging/QA site, and in fact, when we checked there, everything was still working as expected. What could be going on?
Starting on the page in question, we quickly saw that a couple of the states, when selected, were throwing a JavaScript error. In fact, these states were the same ones that failed to show their respective list of cities. Great! There’s the issue.
Why, though, did this just start happening, and to only a handful of states? Looking at the JavaScript error, we saw the log was <Object> has no method ‘sort. Hmm, cities should be an array, not an object. Why are we getting that error?
Looking at the debug console, it seems that some states were giving us an <object> of cities, while most were (correctly) giving us an array. So, of course that wouldn’t work. Great!
Adding extra code to convert possible objects to arrays here could have been the end of it. That still didn’t answer the question, though, of why this just started happening, in fringe cases, on one instance of the site and not another. Tracking down the cities data, we saw that they were being loaded via an ajax call each time a state is selected. In fact, we were simply using jQuery’s built in $.getJSON() function and parsing the results if it was successful. A core jQuery function like this should be behaving a bit more consistently. So, maybe the issue is with the JSON data itself that we were receiving?
Looking at the script that provides the JSON data, at first everything seemed fine. Comparing the cities data of a state that did work vs a state that didn’t didn’t yield anything — they looked identical. The JSON being output seemed valid. Why then was $.getJSON() parsing one as an object, and another as an array? Maybe if we start removing values one at a time…aha! One of these cities has a comma in it! Could that be messing things up? But that only accounts for one of the states. Hold on, this other state has duplicate values, because of incorrect capitalization. What if we add another normalization routine to remove these?
This, not surprisingly, fixed the problem. jQuery’s $.getJSON() function was choking on certain characters in the city names, and deciding that what we really wanted it to convert the JSON data to was an object, not an array, in those cases. It wasn’t until certain city names were added to the database that the problem presented itself.
Now, some of the more JavaScript-savvy readers might have seen this solution coming from the outset. The fact is, though, that this kind of detective work can be difficult, particularly when external parties are anxious to fix the bug, like, yesterday.
As this simple example shows, fixing the obvious issue isn’t always the same as addressing the root problem — especially when that problem is many steps removed from the actual “bug” that can be observed. In this case, a seemingly random issue was really the result of the client entering a city in a slightly different format. Correcting for the incorrect data structure at the JavaScript level might well have left the door open for other issues down the road, if that data feed were consumed elsewhere.
Justin recently wrote about forcing yourself to put your ego and initial reactions aside, and instead treat situations with the consideration and respect they deserve. Tracking down bugs is no different. Constantly asking “why” can seem counter-productive to finding the first fix, fast. Yet there, I think, is where the art of fixing bugs begins.