This topic presents a serious dilemma.<p>One aspect of science that doesn't get much attention in this debate is the role of the scientist as an ethical and idealistic actor; to be a scientist is (or was) have a higher calling, to help humanity get closer to the truth. And this is crucial to science itself because scientists need to be able to <i>trust</i> other scientists. And neither Everyone-watches-everyone-style trust nor you-will-be-punished-harshly-if-caught trust works. You need I-do-it-because-I-believe-in-it trust to make science work.<p>Now, the more that graduate students are made disposable, the more that professors live in a ruthless, sink-or-swim environment and so-forth, the less a scientist is likely to remain an idealist interested first and foremost in discovering the truth and the less that crucial element of trust will remain.<p>The latest fad is "outsourcing science". If we want to make science less broken, it seems like we should be going in the opposite direction.
I'm not in a medical field, but the problem likely exists for our discipline as well.<p>The issue, I suspect, stems from the nature of publishing: top-tier journals only publish "interesting" research, which means reproducing research is less welcomed and if performed needs to be accompanied by a serious value-add.<p>There is no incentive to reproduce. It makes it more difficult to publish. It doesn't lead to tenure. Why bother?
Coming from a computationally intensive discipline in academia it is astounding how difficult it can be for researchers to reproduce their own results. The tendency is to write enough code to generate an impressive diagram for a journal illustration or presentation slide and move on. It's not uncommon to not know what date or version of a constantly shifting public data set the original result was generated from, or even where the scripts are located 6 months down the road. I tied myself in knots trying to iron out data bugs and irregularities that forced me to dump a year of research and recreate the entire upstream data pipeline in my lab.<p>In another example a very promising cancer drug prediction algorithm (with fascinating in vitro results tested by an affiliated lab) was abandoned because of a key researcher's untimely death and the complete lack of version control anywhere in the lab. The paper had already been published (thankfully) but we literally had no idea where the code and the intermediary data were. We had a ~5,000 node GPFS cluster with rolling backups but it didn't help at all because all the development was done locally; the situation was the same across the lab. The decision of the PI in the wake of this compound tragedy was to have lab members pair up and "cross train" each other for an hour and verbally tell them where they kept their important data.<p>Referring to the corrupted data issue I personally experienced, I unfortunately discovered it the night before a multi-departmental research presentation. There were numerous reversed edges in a large digraph due to improper integration of two data sets before my involvement (I was also at fault for trusting internal data). I told the PI about it in the morning since the problem was so deep and said I couldn't present anything because every single result of the past year was invalidated by the bug I had found. His response: present anyway. I refused. That did not go over well.<p>I'd like to see every computational paper (especially in biology where these methods end up influencing human clinical medicine) include all source code in a public repository but it isn't going to happen. Labs would lose their edge if they had to tell competitors what model weights they had iterated to in creating their newest prediction algorithms and university technology transfer departments would have greater difficulties patenting these methods and selling them to drug companies. The current model will not change but a new one might supplant it.<p>I wasn't on the cancer drug prediction project but I probably know enough about it to reconstruct it. It actually seems like a great candidate for an open source project.
Good analysis of providing source code, datasets and potential burdens on reviewers and authors.<p><a href="http://nlpers.blogspot.com/2011/03/some-thoughts-on-supplementary.html" rel="nofollow">http://nlpers.blogspot.com/2011/03/some-thoughts-on-suppleme...</a>