Citing our software and algorithms – a trial

Written by: C. Titus Brown

Primary Source: Living in an Ivory Basement

Our lab is part of the ongoing online conversation about how to properly credit software and algorithms; as is my inclination, we’re Just Trying Stuff (TM) to see what works. Here’s an update on our latest efforts!

A while back (with release 1.0 of khmer) we added a CITATION file to the khmer repository and distribution. This was in response to Robin Wilson’s call for CITATION files, and dovetails with some of the efforts of the R and Debian communities.

In the short term, we don’t expect many people to pay attention to these kinds of infrastructure efforts, and for much of our work we actually have publications on the algorithms involved. More to the point, our software isn’t just software — it’s the instantiation of novel data structures and algorithms, or at least novel applications of data structures. The people who did the research are not necessarily the same people as the developers and maintainers of our software implementation, and we’d like to reward both appropriately with citations.

Additionally, for things like tenure and promotion and grants, often it is the case that only peer reviewed articles count. In this case, having citations accrue to those articles is a good idea!

So, rather than directly citing our tarballs or repositories (see F1000 Research and Mozilla Science Lab’s efforts) we have modified our scripts to output the appropriate citation information. For example, if you run ‘normalize-by-median.py’, you get this output

|| This is the script 'normalize-by-median.py' in khmer.
|| You are running khmer version 1.1-9-g237d5ad
|| You are also using screed version 0.7
||
|| If you use this script in a publication, please cite EACH of the following:
||
||   * MR Crusoe et al., 2014. doi: 10.6084/m9.figshare.979190
||   * CT Brown et al., arXiv:1203.4802 [q-bio.GN]
||
|| Please see the CITATION file for details.

The first citation is the software description, and the second is the normalization algorithm.

Likewise, if you run ‘load-graph.py’, you will see:

|| If you use this script in a publication, please cite EACH of the following:
||
||   * MR Crusoe et al., 2014. doi: 10.6084/m9.figshare.979190
||   * J Pell et al., PNAS, 2014 (PMID 22847406)
||
|| Please see the CITATION file for details.

which is our De Bruijn graph paper.

Interestingly, GNU Parallel also provides citation information:

When using programs that use GNU Parallel to process data for publication please cite:

 O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
 ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; and it won't cost you a cent.

which is pretty cool!

Note also that Michael Crusoe, who works with me on khmer (side note: find Michael some completely over-the-top title – “the khmer software maestro”?), has been working with the Galaxy folk to build citation infrastructure into the Galaxy Tool Shed. Mo’ infrastructure mo’ bettah.


What’s next?

Now that we’re starting to provide unambiguous and explicit citation information, it’s up to individual actors to cite our software appropriately. That’s something we can help with (by mentioning it in e.g. reviews) but I’m not sure how much more we can do in the khmer project specifically. (Suggestions welcome here!)

My biggest unanswered concern in this space is now something completely different: it’s providing (and getting) credit for the CS research. For example, there are several implementations of the digital normalization idea — in silico normalization (in Trinity) and also BBnorm (see here and here). Those are implementations of the underlying idea of normalization, and I (perhaps selfishly) think that in most cases people using the BBnorm or Trinity code should be citing our digital normalization preprint.

This concern emerges from the fact that good algorithm development is largely different from good software development. Many bioinformaticians provide basic proof of concept for an algorithm or a data structure, but do not invest much time and energy in software engineering and community engagement. While this is a great service — we often do need new algorithms and data structures — we also need good implementations of data structures and algorithms. Academia tends to reward the DS&A and not the implementation folk, but I think we need to do both, and need to do both separately. Shifting to a system where only the implementers get credit doesn’t seem like a great improvement to me ;).

So my thought here is that any tool that uses a research algorithm or data structure developed by others should output citation information for that other work. This follows the advice given by Sarah Callaghan to “cite what you use”.

A specific example we’re planning: someone is porting some abandoned thesisware to khmer. The citation information will specify both khmer (for the software implementation) and the methods publication (already published) for basic validation information.

I’m not sure where to draw the line, though — there are clearly cases where the data structures and algorithms have been developed further and our work no longer deserves to be cited in the software, and other cases where the DS&A work may be irrelevant. People using Minia, for example, should not need to cite Pell et al., 2012, because the Minia folk extended our work significantly in their paper and implementation. And, for many instances of k-mer counting, our low-memory k-mer counting work (soon to be published in Zhang et al., 2014) is not necessary — so if they’re using khmer because of its convenient Python API, but not depending in any particular way on low-memory k-mer counting, they probably shouldn’t bother citing Zhang et al. Or maybe they should, because of accuracy concerns addressed in that paper?

I’d be interested in your thoughts and experiences here. (Note, I haven’t sent anything to the Trinity or BBnorm folk because (a) I haven’t thought things through, (b) that’d probably be too aggressive anyway, and (c) we should figure out what the community norms should be first…)

–titus

p.s. Thanks to Michael Crusoe and Varsha Khodiyar for reading a preview of this blog post and giving me feedback!

The following two tabs change content below.
C. Titus Brown
C. Titus Brown is an assistant professor in the Department of Computer Science and Engineering and the Department of Microbiology and Molecular Genetics. He earned his PhD ('06) in developmental molecular biology from the California Institute of Technology. Brown is director of the laboratory for Genomics, Evolution, and Development (GED) at Michigan State University. He is a member of the Python Software Foundation and an active contributor to the open source software community. His research interests include computational biology, bioinformatics, open source software development, and software engineering.