Python, Numpy and a Programmer Error: Story of a Bizarre Bug

While recently working on my performance analysis for Paxos-style protocols, I uncovered some weird quirks about python and numpy. Ultimately, the problem was with my code, however the symptoms of the issue looked extremely bizarre at first.

Modeling WPaxos required doing a series of computations with numpy. In each step, I used numpy to do some computations with arrays. Normally, I would initiate a new array and set the values by doing some calculations on the data from previous steps. However, in one step I used newly initialized array to perform some additions with another numpy array. Of course, by mistake I initialized a new array with numpy.empty() instead of numpy.zeroes(), causing the new array to potentially contain some garbage values that may screw up the entire computation. Obviously, I did not know I made this mistake.

However, most of the times, this new array had all values set to zero, so I consistently observed the correct results. That is until I added a simple print statement (something like print some_array) on some array to check on the intermediate computation in the model. Printing the numpy array caused the entire calculation to screw up, leaving me with a big mystery: how a simple python print statement, that should have no side-effects, change the results of subsequent computations?

I wouldn’t lie, I was mesmerized by this for hours: I remove the print statement and everything works, I add it back and the entire model breaks. Consistently. Even after a reboot. What is even more weird, the bad results I observed were consistently the same, reproducible run after run after run.

And in such consistent failures, I observed a pattern. One computation step was always skewed by the same value, the value of the array I was printing, as if that array was added to the newly initialized array in the skewed step. And this is when I noticed that I use numpy.empty() instead of numpy.zeroes(). A simple fix and the outcome was the same, regardless of whether I print results of the intermediate steps or not.

In the end, it was a programmer’s error, but the bizarre symptoms kept me away from the solution for way too long before uncovering the truth. I am not an expert on the internals of python and numpy, but I do have some clue as of what might have happened. I think, the print statement created some kind of temporary array, and this temporary array got destroyed after the print. (Alternatively, something else created a temporary array, and print statement just shifted the address of memory allocations). Next computation then allocated space for new array, having the same dimensions, in the exact same spot of the old temporary one. And this newly created array then had garbage values, containing the outputs of the previous step.

The interesting part is how consistent this was for hundreds of tries, producing the same failed output. How consistent the memory allocations had to be in every run? And of course, many may not even think about the possibility of having such dirty memory problems in languages such as python. Numpy, however, is written in C, and it clearly brings some of the C’s quirks to python with it, so read the documentation.