Recently I fixed a memory corruption issue, i.e., for a 8-byte memory address, one byte in the middle was set to 0
, so the memory address became invalid, and accessing of this memory caused program crash. This reminded me another memory corruption issue which I fixed before. From my experience, this kind of memory corruption issues are very difficult to debug: the adjacent memory is all good, only one or several bytes are changed to other values. These bugs are not obvious out-of-bound memory access problems, and difficult to find methods to reproduce.
Generally speaking, logs can’t always give you a hand when the memory is random corrupted, not mention in some situations, the traces won’t be provided for reasons. The only thing you can get is the core dump file, and you must utilize the file and try to unearth as much information as possible. E.g., from program’s perspective, what was the state of program when it crashed? Except the ruined memory, were there other abnormalities? From system’s perspective, have you observed all the registers’ values? Are they all valid? If not, which part of code can cause it?
So every time, when you meet a not-easy reproduced bug, don’t freak out. Just calm down and begin to analyze core dump file carefully. You become a detective and core dump file is the crime scene. In reality, you can’t require the criminal to commit again. Similarly, not every bug can reoccur; you must try your best to find the root cause from the core dump file. From my experience, every tough debugging experience can make you understand program and system better. So it is a precious learning opportunity.
Treasure core dump file and enjoy debugging!