# Ordering Index Vector with Java Streams

I bumped up against the following problem while doing some coding in Java 8 (and using streams where possible). Given a vector of objects $$x_1, \dots, x_N$$ that come from some domain having an ordering $$\le$$, find the vector of indices $$i_1, \dots, i_N$$ that sorts the original values into ascending order, i.e., such that …

# Creating A New MIME Type

I struggled a bit this afternoon creating a new MIME type and associating it with a particular application, so I’m going to archive the solution here for future reference. This was on a Linux Mint system, but I found the key information in a GNOME documentation page, so I suspect it works for Ubuntu and …

# Alternative Versions of R

Fair warning: most of this post is specific to Linux users, and in fact to users of Debian-based distributions (e.g., Debian, Ubuntu or Mint). The first section, however, may be of interest to R users on any platform. An alternative to “official” R By “official” R, I mean the version of R issued by the …

# The Monty Hall Evolver

The Monty Hall problem is very famous (Wikipedia, NYT). It is so famous because it so easily fools almost everyone the first time they hear about it, including people with doctorate degrees in various STEM fields. There are three doors. Behind one is a big prize, a car, and behind the two others are goats. …

# Updated Java Utilities for CPLEX and CP Optimizer

I just finished adding a feature to a utility library I use in Java projects that employ either CPLEX or CP Optimizer. In addition, I moved the files to a new home. The library is free to use under the Eclipse Public License 1.0. The code is mentioned in previous posts, so I’ll just quickly …

# An SSH Glitch

Something weird happened with SSH today, and I’m documenting it here in case it happens again. I was minding my own business, doing some coding, on a project that is under version control using Git. After committing some changes, I was ready to push them up to the remote (a GitLab server here at Michigan …

# Coding for kids

I’ve been trying to get my kids interested in coding. I found this nice game called Lightbot, in which one writes simple programs that control the discrete movements of a bot. It’s very intuitive and in just one morning my kids learned quite a bit about the idea of an algorithm and the notion of …

# Thunar Slow-down Fixed

My laptop is not exactly a screamer, but it’s adequate for my purposes. I run Linux Mint 17 on it (Xfce desktop), which uses Thunar as its file manager. Not too long ago, I installed the RabbitVCS version control tools, including several plugins for Thunar needed to integrate the two. Lately, Thunar has been incredibly …

# Parsing Months in R

As part of a recent analytics project, I needed to convert strings containing (English) names of months to the corresponding cardinal values (1 for January, …, 12 for December). The strings came from a CSV file, and were translated by R to a factor when the file was read. The factor had more than 12 …

# Python usage survey 2014

Remember that Python usage survey that went around the interwebs late last year? Well, the results are finally out and I’ve visualized them below for your perusal. This survey has been running for two years now (2013-2014), so where we have data for both years, I’ve charted the results so we can see the changes …

# RStudio Git Support

One of the assignments in the R Programming MOOC (offered by Johns Hopkins University on Coursera) requires the student to set up and utilize a (free) Git version control repository on GitHub. I use Git (on other sites) for other things, so I thought this would be no big deal. I created an account on …

# Estimate whether your sequencing has saturated your sample to a given coverage

This recipe provides a time-efficient way to determine whether you’ve saturated your sequencing depth, i.e. how much new information is likely to arrive with your next set of sequencing reads. It does so by using digital normalization to generate a “collector’s curve” of information collection. Uses for this recipe include evaluating whether or not you …

# Estimating (meta)genome size from shotgun data

This is a recipe that provides a time- and memory- efficient way to loosely estimate the likely size of your assembled genome or metagenome from the raw reads alone. It does so by using digital normalization to assess the size of the coverage-saturated de Bruijn assembly graph given the reads provided by you. It does …

# Downsampling shotgun reads to a given average coverage (assembly-free)

The below is a recipe for subsetting a high-coverage data set to a given average coverage. This differs from digital normalization because the relative abundances of reads should be maintained — what changes is the average coverage across all the reads. Uses for this recipe include subsampling reads from a super-high coverage data set for …

# Extracting shotgun reads based on coverage in the data set (assembly-free)

In recent days, we’ve gotten several requests, including two or three on the khmer mailing list, for ways to extract shotgun reads based on their coverage with respect to the reference. This is fairly easy if you have an assembled genome, but what if you want to avoid doing an assembly? khmer can do this …

# Updated Benders Example

Two years ago, I posted an example of how to implement Benders decomposition in CPLEX using the Java API. At the time, I believe the current version of CPLEX was 12.4; as of this writing, it is 12.6.0.1. Around version 12.5, IBM refactored the Java API for CPLEX and, in the process, made one or …

# Replication and quality in science – another interview

Nik Sultana, a postdoc in Cambridge, asked me some questions via e-mail, and I asked him if it would be OK for me to publish them on my blog. He said yes, so here you go! How is the quality of scientific software measured? Is there a “bug index”, where software loses points if it’s …

# How to make beautiful data visualizations in Python with matplotlib

It’s been well over a year since I wrote my last tutorial, so I figure I’m overdue. This time, I’m going to focus on how you can make beautiful data visualizations in Python with matplotlib. There are already tons of tutorials on how to make basic plots in matplotlib. There’s even a huge example plot …

# Turning Bounds into Constraints in CPLEX

I had to delve into the CPLEX documentation today, and found something I had not seen before. As part of a (Java) program I’m writing, I need to use the conflict refiner to track down which upper and lower bounds on variables take a role in making a linear program infeasible. Of course, I could change the …

# Being a release manager for khmer

We just released khmer v1.1, a minor version update from khmer v1.0.1 (minor version update:220 commits, 370 files changed. Cancel that — _I_ just released khmer, because I’m the release manager for v1.1! As part of an effort to find holes in our documentation, “surface” any problematic assumptions we’re making, and generally increase the bus factor of the khmer project, …

# Learning R

I have recently dedicated myself to learning R, a programming language and environment for focusing largely on statistical analysis and computing. The benefit of using R over other statistical computing packages is that it is free, open-source, and has a hugely active community around its use.  R can be used cross-platform  (PCs, Macs, and Linux) …

# A khmer mini-Hackathon: Introducing scientists to testing and code review

As part of the 2-day Mozilla Science Labs hackathon in late July, the khmer project will be providing a “mentored open source contributathon” experience. This will provide an opportunity for people interested in trying out our instance of the “github flow” model, in which contributions are submitted for review using a pull request. Since our …

# Trying (and failing?) to build a Scalable CountMin Sketch

tl;dr? I played around with building a CountMin Sketch that is dynamic in size, based on a scalable Bloom Filter approach. I’m not sure it worked. Thoughts, suggestions, help? Bloom Filters In our research, we’ve made some hay using Bloom filters. They’re remarkably easy to implement; I’ve talked about them a couple of times on …

# Some thoughts on research coding and Stupidity Driven Development

I’m on a European trip that involves several plane flights accompanied by long airport stays, and I just used some of that time to do a bit of tedious coding on khmer. The coding I did was to add proper exception handling to khmer’s internal file loading routines (see the pull request). The old behavior …

# Programming Language Breakdown for the HealthCare.gov Website

Late last year, the NY Times released an article quoting a specialist working on the HealthCare.gov web site: According to one specialist, the Web site contains about 500 million lines of software code. By comparison, a large bank’s computer system is typically about one-fifth that size. This astronomically large number became the subject of intense …

# A Java Slider/Text Combo

A few years back I was coding (in Java, of course) the <shudder>GUI</shudder> for a research program. I needed to provide controls that would let a user specify priorities (0-100) scale for various things. Two possibilities occurred to me, with pretty much diametrically opposed strengths and weaknesses. Sliders have a few virtues. Grabbing and yanking …

# Setting CPLEX Parameters in Java Revisited

A bit more than a year and a half ago, I wrote some Java code to facilitate setting parameters for the CPLEX optimizer using their Concert API. Since then, I’ve added support for their CP Optimizer, and IBM has refactored the handling of parameters in CPLEX, necessitating an update to my code. This post (which …

# CP Optimizer, Java and NetBeans

After years of coding CPLEX applications in Java, I’ve just started working with CP Optimizer (the IBM/ILOG constraint programming solver) … and it did not take me long to run into problems. As with CPLEX, you access CP Optimizer from Java through the Concert API. As always, I am using the NetBeans IDE to do …

