Why Can't I Reproduce Their Results?

154 pointsby sonabinualmost 5 years ago

18 comments

There's certainly a kernel of truth to this.I worked in a cancer lab as an undergrad. New PI had just come over to the school. He had us working on a protocol he'd developed in his last lab to culture a specific type of glandular tissue in a specific way.I and two other students spent months trying to recreate his results, at his behest. When we couldn't get it to work, he'd shrug and say some shit like "keep plugging away at the variables," or "you didn't pipette it right." I don't even know what the fuck the second one means.Then an experienced grad student joined the lab. He spent, I don't know, a week at it? and was successfully reproducing the culture.I still don't know how he did it. I just know that he wasn't carrying a magic wand, and the PI certainly wasn't perpetrating a fraud against himself. It was just, I guess, experience and skill.

评论 #23753290 未加载

评论 #23753590 未加载

评论 #23756397 未加载

评论 #23755415 未加载

评论 #23754341 未加载

评论 #23754384 未加载

评论 #23753840 未加载

评论 #23757369 未加载

评论 #23754547 未加载

评论 #23754703 未加载

altvalialmost 5 years ago

As an outsider to Academia, this looks like a long array of excuses and saddens me. A huge number of hours are wasted every year by undergrads fighting with incomplete papers when we could do better. We could enforce higher standards. Everything in a paper should be explainable and reproducible. Have a look at efforts by PapersWithCode, Arxiv Sanity, colah's blog, or 3blue1brown in either curating content or explaining concepts. I couldn't find a single excuse in this blog post for which we can't come up with a solution, if we have the consensus to enforce it.

评论 #23757912 未加载

评论 #23755789 未加载

评论 #23757635 未加载

评论 #23756597 未加载

YeGoblynQueennealmost 5 years ago

Large parts of this post are naked apologism for grave aberrations in computer science research, the equivalent to code smells that have calcified and are now "how we've always done things" because nobody has the courage to do anything to fix them, or even knows how. That these are addressed to new PhDs as "this is the real world and you must get used to it" is just tragic. The surest way for all this garbage to remain the way it is, is for new PhDs to accept it as it is and not try to do anything to change it.

评论 #23754356 未加载

评论 #23756306 未加载

评论 #23761660 未加载

评论 #23754428 未加载

m0zgalmost 5 years ago

To be fair, in my field (deep learning, computer vision), the papers often do not contain enough information to reproduce the results. To take a recent example, Google's EfficientDet paper did not contain enough detail to be able to implement BiFPN, so nobody could replicate their results until official implementation was released. And even then, to the best of my knowledge, nobody has been able to train the models to the same accuracy in PyTorch - the results matching Google's merely port the TensorFlow weights.Much of the recent "efficient" DL work is like that. Efficient models are notoriously difficult to train, and all manner of secret sauce is simply not mentioned, and without it you won't get the same result. At higher levels of precision, a single percentage point of a metric can mean 10% increase in error rate, so this is not negligible.To the authors' credit though, a lot of this work does get released in full source code form, so even if you can't achieve the same result on your own hardware, you can at least test the results using the provided weights, and see that they _are_ in fact achievable.

评论 #23754481 未加载

aledalgrandealmost 5 years ago

My personal experience with trying to recreate tech from research papers (real-time mobile computer vision in my case):- assuming highly specialized Math skills (e.g. manifolds), which wasn't me at the time, so it made testing and bug hunting harder- code missing, or code is a long C file vomited out by Matlab, with some pieces switched to (Desktop) SSE instructions for speed gains- papers were missing vital parameters to reproduce the experiment (e.g. exact setting for a variable that influences an optimization loop precision vs speed)- the experiment was very constrained and the whole algorithm would have never worked in real life, which is what I had supposed for the first few months (meh)- most papers, as the article says, are just a little bump over the previous ones, so now you have to read and implement a tree of papers- sometimes there would be the need of a dataset to train a model, but the dataset is closed source and incredibly expensive, so not a good avenueAt the time I was also working with the first version of (Apple) Metal which, after going crazy on why my algo wasn't working, I discovered had a precision bug with the division operator. FMLStill it was a very instructing experience, the biggest takeaway if you do something similar: don't be certain that once you implemented an algorithm, it will work as advertised. It's totally different from say, writing an API, it's not a well constrained problem.

评论 #23757436 未加载

评论 #23764167 未加载

评论 #23758068 未加载

ishchekleinalmost 5 years ago

Hey, DVC maintainer here. For those who interested in this topic, I like this one about the same problem - <a href="https://petewarden.com/2018/03/19/the-machine-learning-reproducibility-crisis/" rel="nofollow">https://petewarden.com/2018/03/19/the-machine-learning-repro...</a> (industry focused) and an excellent talk from Patrick Ball - <a href="https://www.youtube.com/watch?v=ZSunU9GQdcI&t=1s" rel="nofollow">https://www.youtube.com/watch?v=ZSunU9GQdcI&t=1s</a> how they structure data projects.

评论 #23753184 未加载

entha_saavaalmost 5 years ago

It is saddening to see so much pessimism in the thread.Tight regulations on reproducibility are the first thing academia needs. Academia is a rat race with so much dishonesty and optimizing for measure (citation count etc..) these days. I have seen too many of professors producing low quality papers for the sake of producing papers - which creates a lot of noise, and a cargo cult PhD culture. Without reproducibility how can you even trust the results?While academic code doesn't adhere to code quality standards of software development, I don't think many software engineers involve in ridiculing the author if they produce the code, leave alone other academics.Btw, I submitted a link (a very comprehensive article by a CS academic) but it was lost in the HN noise. Maybe someone with higher karma points repost that? I found that article through nofreeview-noreview manifesto website and that's very well written, it covers a number of problems like this.. (<a href="http://a3nm.net/work/research/wrong" rel="nofollow">http://a3nm.net/work/research/wrong</a>).. PS I am no way affiliated to the author. Just mentioning because that article deserves to be at HN front page.

worikalmost 5 years ago

This is why publishing code and data, together, is so important.Irreproducibile results in computing have no justification in this day and age.

评论 #23755026 未加载

评论 #23761912 未加载

评论 #23757375 未加载

评论 #23757461 未加载

评论 #23754440 未加载

gitgudalmost 5 years ago

Even academic papers with open-source code can be infuriating to get working. Usually it's a mess of hidden dependencies, specific versions of global libraries, hard coded paths, undocumented compiler settings, specific OS versions...Usually, the highly specific experience and knowledge of the author is assumed in the reader...

Heliosmasteralmost 5 years ago

A bit of a plug, but this is exactly the reason why we are building Nextjournal [0].We've built the platform from the ground up with immutability in mind, leveraging Clojure and Datomic which are a great fit for this architecture.[0]: <a href="https://nextjournal.com" rel="nofollow">https://nextjournal.com</a>

smitty1ealmost 5 years ago

> To describe the implementation in a way which is less precise, but simpler, shorter, and easier for the reader to understand.I'm waiting for the textbook that offers formulae with code and a but of regression data.

评论 #23753938 未加载

User23almost 5 years ago

Just as a neat bit of trivia that isn't mentioned in the article, the inventor of the equals sign was Robert Recorde[1]. Dijkstra provides some additional background at [2].[1] <a href="https://en.wikipedia.org/wiki/Robert_Recorde" rel="nofollow">https://en.wikipedia.org/wiki/Robert_Recorde</a>[2] <a href="https://www.cs.utexas.edu/users/EWD/transcriptions/EWD10xx/EWD1073.html" rel="nofollow">https://www.cs.utexas.edu/users/EWD/transcriptions/EWD10xx/E...</a>

eximiusalmost 5 years ago

I do wonder if standardized equipment with digital controls (or digitally monitored controls) to record all salient values throughout an experiment would 'solve' this.Obviously some cutting edge stuff can't use standardized equipment, but can you standardize a lot of other stuff?

评论 #23755507 未加载

评论 #23754443 未加载

fxtentaclealmost 5 years ago

In the case of AI: just train again. Good or bad luck with the random weight initialization can have a huge influence on your results. Nobody really talks about it, but many "pro" papers use a global seed and deterministic randomness to avoid the issue.

gravypodalmost 5 years ago

> I'm about to tell you something which can sometimes be harder to believe than conspiracy theories about academia: you've got a bug in your code.If this is the case in a majority of the instances where someone fails to reproduce the software backing a paper then there may be another issue at play. Someone re-implementing a paper is "just" taking a written description of something and transcribing it to code. If a mistake, that can completely ruin the results of the work, can be made this easily it should be fairly straight forward to see that the original implementer could also have made a mistake that could have thrown off their results.Does the world of academia have any tools to prevent this? It seems like this could effect a lot of the research being done today. Given the following:1. Most research being done today utilizes some software to generate their results.2. This software often encodes some novel way of implementing some analysis method.3. The published paper from research with a novel method of analysis will need to describe their method of analysis.4. Future researchers will find this paper, attempt to implement the described analysis method, and publish new results from this implementation.We can see we are wasting a lot of resources reimplementing already written code. We may also not be implementing this code correctly and may be skewing results in different ways.> Debugging research code is extremely difficult, and requires a different mind set - the attention to detail you need to adopt will be beyond anything you've done before. With research code, and numerical or data-driven code in particular, bugs will not manifest themselves in crashes. Often bugged code will not only run, but produce some kind of broken result. It is up to you to validate that every line of code you write is correct. No one is going to do that for you. And yes, sometimes that means doing things the hard way: examining data by hand, and staring at lines of code one by one until you spot what is wrong.This is a very common mindset that many academics have but I don't understand why this is the case. A paper seems like a fantastic opportunity to define an interface boundary. If your paper describes a new method to take A and transform it into B then it should be possible for you to write and publish your `a_to_b` method aside your paper. You could even write unit tests on your `a_to_b`. If a future researcher comes along and finds a way to apply your `a_to_b` to more things they could modify your code and, just by rerunning your tests, should be able to verify their implementation actually works.If a future researcher decided to use you `a_to_b` you could write some code to automatically generate a list of papers to reference.If we are seriously spending this much time treading the same water then it should be possible to dramatically improve the quality and throughput of academics by providing some tool like this to them.I know someone will say "but you gain so much knowledge re-implementing XYZ" and to them I'd say that you don't need to read the code while reading the paper. You could write the code yourself and just utilize the unit tests provided by the author to make sure you fully understand each edge case.

评论 #23754165 未加载

7532yahoogmailalmost 5 years ago

Wow. Hell of a good read. And smart points too.

ipunchghostsalmost 5 years ago

Let me make a statement and let you can judge it. (its below and stated as "STATEMENT")BACKGROUND I have been working for 15 years in industry doing hardcore ML. I have the fortunate drive and background that I was able to get my masters degree while working full time from an R1 school. No watered-down online degree, no certificate. I would drive twice a week to class for 4 years and did a full thesis which was published. Since then, I have published 6 papers, all peer reviewed. I even did a sabbatical with another research lab of which I was invited to come.After 15 years, I decided to go back and get my Phd, all while continue to work full time. My thought was that it would be easy to get a phd with all my technical experience and math chops ive developed over the last 15 years. I essentially have been doing math 5 days a week for 15 years. Here's what happened...Coursework was a breeze. I barely put any time into it and I easily can get a B+. This is really helpful because I am working 40-50 hours a week at my full time job nd managing my family. I passed my candidacy exam on the first try with little issue (this is rare for my department).The biggest hangup I have about the phd process is what my advisor wants me to do when writing papers. He is the youngest full professor in the department and is from a well known and well respected graduate research university. But, the way he has me slant my papers is absurd. Results which I feel are very important to the assessment of the reader to decide if they should use the method, he has me remove because the results are "too subtle." He is constantly beating on me to think about "the casual reviewer."Students in the lab produce papers which are very brittle and overfit to the test data. His lab uses the same dataset paper after paper. My advisor was so proud of a method his top student produced that he offered his code for my workplace to use. It didn't work as well as a much simpler method we used. Eventually we gave the student our data so there could be a fairest shake at getting students method to work. The student never got the method work to work nearly as well as in his published paper despite telling my company and I over and over that "it will work". The student is now at amazon lab 126.STATEMENT: Academia is peer reviewed driven but the peers are other academics and so the system of innovation is dead; academics have very little understanding of what actually works in practice. Great example: its of no surprise that Google has such a hard time using ML on MRI datasets. The groups working on this are made up of Phds from my grad lab!TL;DR - worked for 15 years, went back for phd, here's what i hear:"think of the casual reviewer""fiddle with your net so that your results are better than X""you have to sell your results so that its clear your method has merit""can you get me results that are 1% better? use the tricks from blog Y""As long as your results are 1% better, you are fine"Edit 1:boasts are given to avoid "your experience doesn't count because you X" strawmans, where X={are lazy,are a young student, are inexperienced, went to an easy school, are in an easy program, naive to the peer review process}

评论 #23753973 未加载

评论 #23754171 未加载

评论 #23754365 未加载

评论 #23753944 未加载

Eccoalmost 5 years ago

Who’s «Reviewer 2 »?

评论 #23755153 未加载