Sunday 6 December 2015

Open code: not just data and publications


I had a nice exchange on Twitter this week.

Nick Fay had found a tweet I had posted over a year ago, asking for advice on an arcane aspect of statistical analysis:


I'd had some replies, but they hadn’t really helped. In the end, I’d worked out that there was an error in the stats bible written by Rand Wilcox, which was leading me astray. Once I’d overcome that, I managed to get the analysis to work.

It was clear Nick was now having the same problem and going round in exactly the same circles I had experienced.


My initial thought was that I could probably dig out the analysis and reconstruct what I’d done, but my heart sank at the prospect. However, then I had a cheerful thought. I had deposited the analysis scripts for my project on the Open Science Framework, here. I checked, and the script was pretty well annotated, and as a bonus you got a script showing you how to make a nice swarm plot.

This experience comes hard on the heels of another interaction, this time around a paper I’m writing with Paul Thompson on p-curve analysis (latest preprint is here). Here there’s no raw data, just simulations, and it’s been refreshing to interact with reviewers who not only look at the code you have deposited, but also make their own code available.  There’ve been disagreements with the reviewers about aspects of our paper, and it helped enormously that we could examine one another’s code. The nice thing is that if code is available, you get to really understand what someone has done and also learn a great deal about coding.

These two examples illustrate the importance of making code open. It probably didn’t matter much when everyone was doing very simple and straightforward analyses. A t-test or correlation can easily be re-run from any package given a well-annotated dataset. But the trend in science is for analyses to get more and more complicated. I struggle to understand the methods of many current papers in neuroscience and genetics – fields where replication is sorely needed but impossible to achieve if everyone does things differently and only incompletely described. Even in less data-intensive areas such as psycholinguistics, there has been a culture change away from reliance on ANOVAs to much more fancy multilevel modelling approaches.

My experience leads me to recommend sharing of analysis code as well as data: it will help establish reproducibility of your findings, provide a training tool for others, and ensure your work is in a safe haven if you need to revisit it.

Finally, this is a further endorsement of Twitter as an academic tool. Without Twitter I wouldn't have discovered Open Science Framework, or PeerJ, both of which are great for those who want to embrace open science. And my interchange with Nick was not the end of the story. Others chipped in with helpful comments, as you can see below:


P.S. And here is another story of victory for Open Data, just yesterday, from the excellent Ed Yong.

3 comments:

  1. I release all of my code publicly, as a matter of principle. GitHub gist is really great for snippets of code and data: (eg, https://gist.github.com/richarddmorey/5f22fc742535d078c25f and https://gist.github.com/richarddmorey/862ca2681afd3cd85b3b) which can all be read directly into R (see my blogpost here for an example: http://bayesfactor.blogspot.co.uk/2015/11/neyman-does-science-part-2.html). It can be public or private, versions are tracked and others can comment. It's really great.

    In doing research for what is now the PRO Initiative, however, I learned that most Universities have a policy that retains the copyright for code. As a University employee, by law the University gets the copyright of everything you produce. Data is not copyrightable anyway, so that's not a problem; copyright for text is often reverted back to the academic to prevent any complications. Code, however, is seen as monetizable due use in software. Of course, the vast majority of code written by academics is analysis or simulation code, not part of any software package. Even software written by academics should often be open source. So this situation makes no sense.

    So when release or license my code, I'm technically flouting the rules. The lawyers at Cardiff tell me there's no problem, because the policy was intended for other purposes, and they would never stand in the way of open software and analysis code (which is likely true) but the fact remains that I cannot technically license the code I write. If someone else uses my code in a project, and that project makes money, the University could ask for a slice. This legal greyness is something we need to fix.

    The quick rise of open source among academics has taken University intellectual policy people by surprise; their policies are still geared toward an old way of doing things. I've talked a bit with Chris Chambers about trying to get the policy folks engaged on explicitly fixing this. Perhaps you'd be interested in being involved?

    ReplyDelete
  2. Thanks Richard. I do need to get to grips with Github - this was mentioned on Twitter and I had to explain I had intended to but not managed to get round to it.
    Re legal situation: I'm quite relaxed about flouting rules. Especially stupid rules. Happy to discuss further when we next meet up.

    ReplyDelete
  3. This comment has been removed by a blog administrator.

    ReplyDelete