BishopBlog: Data sharing: Exciting but scary

Monday 26 May 2014

Data sharing: Exciting but scary

Yesterday I did something I've never done before in many years of publishing. When I submitted a revised manuscript of a research report to a journal, I also posted the dataset on the web, together with the script I'd used to extract the summary results. It was exciting. It felt as if I was part of a scientific revolution that has been gathering pace over the past two or three years, which culminated in adoption of a data policy by PLOS journals last February. This specified that authors were required to make the data underlying their scientific findings available publicly immediately upon publication of the article. As it happens, my paper is not submitted to PLOS, and so I'm not obliged to do this, but I wanted to, having considered the pros and cons. My decision was also influenced by the Wellcome Trust, who fund my work and encourage data sharing.

The benefits are potentially huge. People usually think about the value to other researchers, who may be able to extract useful information from your data, and there's no doubt this is a factor. Particularly with large datasets, it's often the case that researchers only use a subset of the data, and so valuable information is squandered and may be lost forever. More than once I've had someone ask me for an old dataset, only to find it is inaccessible, because it was stored on a floppy disk or an ancient, non-networked computer and so is no longer readable. Even if you think that you've extracted all you can from a dataset, it may still be worth preserving for potential inclusion in future meta-analyses.

Another value of open data is less often emphasised: when you share data you are forced to ensure it is accurate and properly documented. I enjoy data analysis, but I'm not naturally well-disciplined about keeping everything tidy and well-organised. I've been alarmed on occasion to return to a dataset and find I have no idea what some of the variables are, because I failed to document them properly. If I know the world at large will see my dataset then I won't want to be embarrassed by it, and so I will take more care to keep it neat and tidy with everything clearly labelled. This can only be good.

But here's the scary thing. data sharing exposes researchers to the risk of being found out to be sloppy or inaccurate. To my horror, shortly before I posted my dataset on the internet yesterday I found I'd made a mistake in the calculation of one of my variables. It was a silly error, caused by basing a computation on the wrong column of data. Fortunately, it did not have a serious effect on my paper, though I did have to go through redoing all the tables and making some changes to the text. But it seemed like pure chance that I picked up on this error – I could very easily have posted the dataset on the internet with the error still there. And it was an error that would have been detected by anyone eagle-eyed enough to look at the numbers carefully. Needless to say, I'm nervous that there may well be other errors in there that I did not pick up. But at least it's not as bad as an apocryphal case of a distinguished research group whose dramatic (and published) results arose because someone forgot to designate 9 as a missing value code. When I heard about that I shuddered, as I could see how easily it could happen.

This is why Open Data is both important for science but difficult for scientists. In the past, I've found mistakes in my datasets, but this has been a private experience. To date, as far as I am aware, no serious errors have got into my published papers – though I did have another close shave last year when I found a wrongly-reported set of means at the proofs stage, and there have been a couple of instances where minor errata have had to be published. But the one thing I've learned as I wiped the egg off my face is that error is inevitable and unavoidable, however careful you try to be. The best way to flush out these errors is to make the data public. This will inevitably lead to some embarrassment when mistakes are found, but at the end of the day, our goal must be to find out what is the case, rather than to save face.

I'm aware that not everyone agrees with me on this. There are concerns that open data sharing could lead to scientists getting scooped, will take up too much time, and could be used to impose ever more draconian regulation on beleaguered scientists: as DrugMonkey memorably put it: "Data depository obsession gets us a little closer to home because the psychotics are the Open Access Eleventy waccaloons who, presumably, started out as nice, normal, reasonable scientists." But I think this misses the point. Drug Monkey seems to think this is all about imposing regulations to prevent fraud and other dubious practices. I don't think this is so. The counter-arguments were well articulated in a blogpost by Tal Yarkoni. In brief, it's about moving to a point where it is accepted practice to make data publicly available, to improve scientific transparency, accuracy and collaboration.

10 comments:

StokesBlog27 May 2014 at 08:28
Hi Dorothy,

Great post! I completely agree that posting data online also improves one's own approach to the analysis. The simple act of preparing the data for someone else to understand is a useful debugging tool, gives you a different perspective and forces you to be a bit more organised/systematic than you might otherwise be. Russ Poldrack made a similar point recently, in a nice post dissecting a coding error he only detected when he shared his analysis scripts: http://www.russpoldrack.org/2013/02/anatomy-of-coding-error.html. Basically, error is inevitable, especially for bespoke script-based analyses, so we really do need another pair of eyes. I am always struck by how much scrutiny we put to the text of a manuscript, send it around to colleagues and co-authors for endless proof-reads, edits, corrections, but rarely show anyone else the original working out of analyses. This must be the wrong way around. Moreover, I worry that error is not random. We are far more likely to double check anomalies that contradict our hypotheses than nice publishable results (biased debugging: http://the-brain-box.blogspot.co.uk/2013/02/biased-debugging.html). Preparing data (and analysis scripts) for public scrutiny is a great way to improve the reliability of research findings.

Mark

ReplyDelete
Replies
jrkrideau27 May 2014 at 19:03
The Reinhart-Rogoff error – or how not to Excel at economics
http://theconversation.com/the-reinhart-rogoff-error-or-how-not-to-excel-at-economics-13646

Richard Tol
Errors in estimates of the aggregate economic impacts of climate change
http://www.lse.ac.uk/GranthamInstitute/Media/Commentary/2014/April/Errors-in-estimates-of-the-aggregate-economic-impacts-of-climate-change-%E2%80%93-Part-II.aspx

Financial Times Finds “Many” Errors in Piketty Analysis, Argues They Undermine His Thesis
http://www.nakedcapitalism.com/2014/05/financial-times-finds-many-errors-piketty-analysis-argues-undermine-thesis.html
ReplyDelete
Replies
deevybee28 May 2014 at 09:14
thanks for your comment.Re Excel : I think it has its uses- I will often do a preliminary quick and dirty look at a dataset in Excel, not least because it is so easy to see data and plots alongside one another. I then do the serious analysis in R or SPSS, but having the Excel version provides a good double check, and I have trapped errors when different approaches give discrepant results.
I am fascinated by the current debate on Piketty - I had only been very vaguely aware of this until someone on Twitter asked if my post was inspired by the Piketty case. I can now see why - very interesting parallels in terms of error detection. I liked this account of the story, which I think is v balanced:
http://fivethirtyeight.com/features/be-skeptical-of-both-piketty-and-his-skeptics/
ReplyDelete
Replies
jrkrideau14 June 2014 at 17:22
An example of data sharing at its best: https://www.youtube.com/watch?v=N2zK3sAtr-4
ReplyDelete
Replies
Joe2 July 2014 at 13:10
Out of curiosity, where did you post the data set and R code? GitHub?
ReplyDelete
Replies

Add comment

New comments are not allowed.

BishopBlog

Monday 26 May 2014

Data sharing: Exciting but scary

10 comments:

Search This Blog

Prizewinning blog

Popular Posts

Blog Archive

Contributors

Followers

BishopBlog

Monday 26 May 2014

Data sharing: Exciting but scary

10 comments:

Search This Blog

Subscribe To

Prizewinning blog

Popular Posts

Blog Archive

Contributors

Followers