Reproducibility and Java Collections

Written by: Paul Rubin

Primary Source: OR in an OB World

I’ve been immersed in Java coding for a research project, and I keep tripping over unintentional randomness in the execution of my code. Coincidentally, I happened to read a blog post today titled “Some myths of reproducible computational research“, by C. Titus Brown, a faculty member (one of those hybrid species: Computer Science and Biology) atMichigan State University, my former employer. The post is worth reading. I’ve seen quite an up-tick in online discussions of the importance of reproducible research, sharing code, sharing data etc., to which I will add one more virtue (specific to computational research): it’s a darn site easier to debug a program whose behavior is deterministic compared to debugging an inherently stochastic program.

That brings me to a couple of tripwires that have been snagging me on the recent project (and some other projects, for that matter). First, any code that uses parallel threads is capable of injecting some unwanted randomness. A given thread will consume different amounts of time in different runs, due to uncontrollable (and hence, for our purposes, “random”) interference by external processes that from time to time are swapped into the CPU core running the given thread. If your program has multiple interdependent threads, the timing of when one thread gets a message from another thread will be different on each run unless you do some serious (and IMHO seriously tedious) magic to synchronize them. I generally shy away from coding multiple threads myself, the exception being programs with a graphical user interface, where the GUI has its own set of scheduling/event threads, and really lengthy computations need to be done in “worker” threads to avoid paralyzing the GUI. Even then, when I’m using a multithreaded library such as CPLEX, execution times vary in unpredictable ways.

The second tripwire has to do with Java collections. As a mathematician, I tend to think in terms of matrices, vectors and sets, and not so much in terms of lists and maps. I’ll often code a collection of objects as a HashSet, both because I think of them as a (mathematical) set rather than a list, and because the Set interface in Java has one key property not shared by the List interface: adding an object to a Set that already contains it does not create a second instance of the object. Unfortunately, a key shortcoming (IMHO) of HashSet is clearly spelled out in its documentation:

It makes no guarantees as to the iteration order of the set; in particular, it does not guarantee that the order will remain constant over time.

To give one example of what this implies for reproducibility (and debugging), I coded a heuristic for constructing a spanning tree that takes eac node from a set (as in HashSet) of unattached nodes and links it to randomly selected node from the tree under construction. The selection of the node already in the tree to which the new node will be linked (by an edge) is done using a pseudorandom number stream with an explicit, user-specified seed. By repeating the seed, I could reproduce the choices of those nodes, if only the rest of the heuristic were deterministic. Unfortunately, because iterating over a HashSet is random in irreproducible ways, I get a different tree each time I run the code, even with the same seed value. So I need to change that HashSet to a list of some sort … and I need to remember this lesson the next time I’m tempted to use a HashSet.

The following two tabs change content below.
I'm an apostate mathematician, retired from a business school after 33 years of teaching mostly (but not exclusively) quantitative methods courses. My academic interests lie in operations research. I also study Tae Kwon Do a bit on the side.

Latest posts by Paul Rubin (see all)