We just had some customers report a bug. Not good. We didn’t get an exception email. All the tests passed. We couldn’t see anything untoward in the log files. But it was there. We could reproduce it, both in staging and in production. Not good at all.
But the weirdest thing was we couldn’t figure out the cause. Well I could see why the code was failing (after adding some extra log messages). But ‘git blame’ said those lines of code were unchanged in twelve months. Why hadn’t people complained before? Why hadn’t we noticed it?
After much hunting through log files we found the point when the feature last worked. It coincided with a deployment. That deployment was our Rails 2.3.4 forms vulnerability fix. And the bug was in a form – a missing form parameter that earlier versions of Rails ignored, the newer Rails was choking on.
But why didn’t the tests catch it?
After more hunting I saw that the Cucumber test that exercised the form didn’t have a “When I press the Update button” step. And the subsequent tests were passing, even though the update button hadn’t been pressed.
So I added the step in and made the feature pass. Then deployed it as an emergency fix.
However, what are the lessons to learn here (as there are always some)?
- Firstly, testing cannot catch everything.
- Secondly, the cracks in your tests are where the bugs are.
- Thirdly, we probably need some sort of peer review for tests. I feel that this is more important than for code, because once the tests are right you can refactor the code without worry.
- Fourthly, you really need to log everything. Absolutely everything. Don’t worry about your huge log files – that’s what `logrotate` is for. Get it written down so that one time when you have an obscure bug, you’ll be able to find it easily.