In the past month, I fixed two multi-thread bug, and the symptoms of these two bugs are:
a) For the first bug: some threads are dead-locked. This bug only occurs on few production machines, and the frequency is not high. And this bug never happens in testbed.
b) For the second one: the program will crash after running for 3
~ 5
hours, and the reason is the program enters a should-never-enter code path which will trigger assert
. Though there is the core dump file, I can’t find any clues from the crime scene.
The straightforward way to debug first bug is checking all lock and unlock operations are paired in any path. Unfortunately, that is not the root cause, so I began to check all code which is related to the lock. After two days, I finally got a copy-pasta error which can open a can of worms.
For the second bug, I went through all code related to multi-thread access problematic variable one line by another, to see whether there is a corner case which can incur contention. Thank god! When I have a rest at the noon, I finally had the idea!
You can see, during the debug process of these two bugs, I can’t find other better method except reading code again and again (I indeed tried to add more traces but it didn’t work). BTW, the common thing of these two bugs is the fix is simple: just modifying one line of code.