Other Thoughts

Python, Numpy and a Programmer Error: Story of a Bizarre Bug

While recently working on my performance analysis for Paxos-style protocols, I uncovered some weird quirks about python and numpy. Ultimately, the problem was with my code, however the symptoms of the issue looked extremely bizarre at first.

Modeling WPaxos required doing a series of computations with numpy. In each step, I used numpy to do some computations with arrays. Normally, I would initiate a new array and set the values by doing some calculations on the data from previous steps. However, in one step I used newly initialized array to perform some additions with another numpy array. Of course, by mistake I initialized a new array with numpy.empty() instead of numpy.zeroes(), causing the new array to potentially contain some garbage values that may screw up the entire computation. Obviously, I did not know I made this mistake.

However, most of the times, this new array had all values set to zero, so I consistently observed the correct results. That is until I added a simple print statement (something like print some_array) on some array to check on the intermediate computation in the model. Printing the numpy array caused the entire calculation to screw up, leaving me with a big mystery: how a simple python print statement, that should have no side-effects, change the results of subsequent computations?

I wouldn’t lie, I was mesmerized by this for hours: I remove the print statement and everything works, I add it back and the entire model breaks. Consistently. Even after a reboot. What is even more weird, the bad results I observed were consistently the same, reproducible run after run after run.

And in such consistent failures, I observed a pattern. One computation step was always skewed by the same value, the value of the array I was printing, as if that array was added to the newly initialized array in the skewed step. And this is when I noticed that I use numpy.empty() instead of numpy.zeroes(). A simple fix and the outcome was the same, regardless of whether I print results of the intermediate steps or not.

In the end, it was a programmer’s error, but the bizarre symptoms kept me away from the solution for way too long before uncovering the truth. I am not an expert on the internals of python and numpy, but I do have some clue as of what might have happened. I think, the print statement created some kind of temporary array, and this temporary array got destroyed after the print. (Alternatively, something else created a temporary array, and print statement just shifted the address of memory allocations). Next computation then allocated space for new array, having the same dimensions, in the exact same spot of the old temporary one. And this newly created array then had garbage values, containing the outputs of the previous step.

The interesting part is how consistent this was for hundreds of tries, producing the same failed output. How consistent the memory allocations had to be in every run? And of course, many may not even think about the possibility of having such dirty memory problems in languages such as python. Numpy, however, is written in C, and it clearly brings some of the C’s quirks to python with it, so read the documentation.

Retroscoping Zookeeper Staleness

ZooKeeper is a popular coordination service used as part of many large scale distributed systems. ZooKeeper provides a file-system inspired abstraction to the users on top of its replicated key-value store. Like other Paxos-inspired protocols, ZooKeeper is typically deployed on at least 3 nodes, and can tolerate F node failure for a cluster of size 2F+1. One characteristic of ZooKeeper is that it runs the consensus algorithm for update operations, however to speed things up reads may be served locally by the replica a client connects to. This means that a value read from a replica may be stale or outdated compared to the leader nodes or even other followers. This makes it a user’s problem to tolerate the stale data read from ZooKeeper, or perform sync operation to get up-to-date with the leader before reading.

But how stale can data become in a ZooKeeper cluster? This is a rather tricky question to answer, since each replica runs on a separate machine with different clock. We cannot just observe the time a value become available at each node, because these timestamp are not comparable due to clock skew.  Instead of trying to figure out the staleness time, I decided to look at how many versions behind can a replica be. I used my Retroscope tool to keep track of when the data becomes available to the client at each replica. I used Retroscope Query Language (RQL) to collect the data from nodes and look at the consistent cuts progressing through the states of ZooKeeper. Retroscoping ZooKeeper took only about 30 lines of code to be added to the project.

Initial run. 1 version staleness is expected.

I deployed a small ZooKeeper cluster on 3 AWS t2.micro instances (I know, it is far from production setup, but it works well for a quick test). On a separate instance I deployed RQL server. To start, I simply created one znode and updated its value. I then proceeded with running the simplest possible RQL query: SELECT retro FROM setd; in this query, retro corresponds to /retro znode I’ve created, and setd is the name of the Retroscope log I put my monitoring data into. The result of the query was exactly what I have expected: at one of the consistent cuts one of the nodes had a value one version behind. This is entirely normal behavior, as the value needs to propagate from the leader to the followers once decided.

My next move was to give a bit more work to the ZooKeeper in a short burst, so I quickly wrote a small program that puts some incremental values to n znodes for a total of r writes. It starts with a value tst1, then goes to tst2 and so on for every znode.  At first I restricted n=1, as I felt that writing to just one znode to create a “hot-spot” was going to give me the best chance of getting stale values. But ZooKeeper handled the burst of 100 writes easily, with the results being identical to a single write: stale value was at most 1 version behind the current data.

Two versions behind.

Seeing things work well was not interesting for me, so I decided to no longer play fair. I artificially made one replica to work slower and be a struggler node. ZooKeeper protocol tolerates this just fine, as it needs 2 out of 3 nodes in my cluster to form the majority quorum and make progress. For this crippled ZooKeeper setup I re-ran the workload and my simple query. Needless to say, I was able to spot one time a system had a node with a znode 2 versions behind, and it was my struggler machine. In the next step I increased the load on the cluster and made the workload write 1000 values to the same znode. I also changed the query so that I do not have to manually look through thousands of consistent cuts trying to spot the stale data. My new query emits consistent cuts with staleness of 2 or more versions:

SELECT r1 FROM setd WHEN Int(StrReplace(r1, “tst”, “”)) – Int(StrReplace(r1, “tst”, “”)) > 1;

The interesting thing about the query above is that it uses the same variable name twice in the expression, essentially telling RQL to output a cut when r1 – r1 > 1. However, there are are many r1‘s in the system (3 in fact, one at each node), so when a pair of r1‘s that satisfy WHEN condition is found (at different nodes), RQL will output the consistent cut.

Very stale.

I was a bit surprised by the results. At first the struggler managed to keep-up with the rest of the cluster quite well, slipping 2 version behind on occasion, but after about 200 requests things went out of control for the crippled node, with the staleness growing to be 158 version behind towards the end of the run. Of course a struggler node will make things look worse, but it is not an unrealistic scenario to have underperforming machines. My test however is not fair either, as I was using a 100% write workload targeting just one znode. So In the next try I changed the workload to target 100 different znodes, while still measuring the staleness on just one znode. In that experiment, the  staleness was not that high, but the number of updates to a single znode was only a small fraction of the previous test. Nevertheless, struggler replica was as much as 6 version stale, making it roughly 1/3rd of updates behind the rest of the cluster.

For the last quick test, I tried doing a large burst of 2000 writes to the same znode on a healthy cluster with no struggler nodes. Despite all replicas working at their proper speed, I was able to observe staleness of 3 versions on some occasions.

I am not sure about the lessons learned from this quick experiment. I was amazed by how easy it is to get ZooKeeper to have data 2 or more versions behind, although the system seems fast to catch up. Struggler scenario, however, illustrated how quickly things can get out of control with just some performance degradation at one of the replicas.  Engineers using ZooKeeper must build their applications in such a way to tolerate the stale ZooKeeper values gracefully.

Why Government IT is Expensive and Archaic

Disclaimer: I do not work for the government, and my rant below is based on my very limited exposure to how IT works at the US government setting.

Why Government IT is Expensive and Archaic? I think, this can be a very long discussion, but I do have a quick answer:  standards imposed by government and used by the government regulated industries. I have a very little experience with these types of standards, but they make me cringe every time I have to deal with them. Bellow I briefly describe my encounter with them.

When I was just a college student, I joined a (very) small IT company sitting next to the University campus. I started just as an intern during my 3rd year of college, and I was working fulltime a year later. At the same time, our team was tasked with making a piece of software for a private company to be used for tracking medical services provided to students at public schools throughout the state. This piece of software was to replace an older one, and we had a very strict set of requirements: “Make it work and look the same but better”.  Such requirements, along with many decisions a college-student-turned-software-engineer had to make, shaped how the system is working right now. Of course, it was not just me making the product, but nevertheless, my “brilliant” ideas slipped in and became what the system is today.

One part if the system is responsible for billing the medical services tracked in the system to Medicaid. And this is where the interactions with government IT has started. Government agency cannot just provide a secure API for the software developers like myself to use. No, it has to hire a major business to implement a standard commonly used for Medical transaction. And here we are, in the 21st having to adhere to the Electronic Data Interchange (EDI) X12, a standard designed in the 1970s to transfer medical (among other EDI uses) data between computer systems. Don’t get me wrong, standards are good, they make different systems work together flawlessly… But that is until you start looking at the standard. For the system, we had to implement only one transaction type at that moment, so what can be difficult about it? The difficulty started with a 700 page manual just for constructing the transaction request. The manual is accompanied by an errata and an errata of errata. In addition, there is a 100 page companion guide that specifies procedures specific for the state Medicaid.

So how easy is the standard itself? Maybe it is not that complicated to work with and the manuals are just full of fluff? Well, EDI is a textual format, so in theory a person can read the file to see all the data. Manuals even provide the example of how pieces of it should look like: SV1HC:99211:2512.25UN111✽✽1:2:3✽✽Y~

Easy enough to read? Nope, so you are back at the manual studying what everything means. For example, SV1 is the header for section describing the professional (medical) service provided, HC code describes what type of codes to follow next, 99211 is the service code followed by code modifiers and modifiers of modifiers and so on. Somewhere in there is how much to charge Medicaid and how many units of service have been provided. But on the bright side, it has cool delimiters: stars, colons and tildes. To top it off, each field can be flexible in size or can be restricted to some number of characters or to a certain set of values, and to find this you consult the manual once again.

What if you make an error? Not a problem, the response to the request comes back as an EDI file, equally cryptic, that describes what went wrong. And we are back to the manuals, counting stars and tildes to check if you send all the data and whether it was in the right format and right order.

It is also worth mentioning that EDI format is used for HIPAA protected medical data, but it has no security built in, everything is plaintext and with the help of manual anyone can read it.  The transmission of the requests, however, is carried out over a secure channel. In my case, I am sending batched requests for processing and the only way to transmit those is by using an SFTP. Needless to say it also becomes my responsibility to pull the response files from the SFTP server once requests are processed, and since there is no strict guarantees on when the processing is completed for each batch, I just do it periodically and eventually collect the responses. How the EDI files are stored on the system side is all up to the engineers designing such system, and as users we can only hope it has passed HIPAA compliance checks and secures the data at rest.

Obviously the archaic standards, like EDI, work well in practice. After all, despite my “brilliant” ideas implemented in other parts of the system, the EDI layer was relatively problem-free. But the standards definitely can be better and new standards, if developed, can provide improved security,  they can be easier to work with, and they will reduce the costs of developing new software. However, I suspect the costs of switching to new standards are just too high for the existing infrastructure, forcing the industry to create a bigger and bigger gap between what government agencies require and what is modern, efficient and secure.

I no longer work for the company, but I still maintain the software. And I was dreading the time a client asks for changes to how the system interacts with Medicaid. It seems like this time is upon me now, and in the summer I will be looking at more manuals for EDI X12 requests needed to implement new features.