Why scientific programming does not compute

192 点作者 szany将近 14 年前

28 条评论

dasil003将近 14 年前

For all the talk of "best practices" and "training" the depressing truth is that guaranteeing correct software is incredibly difficult and expensive. Professional software engineering practices aren't nearly sufficient to guarantee correctness with heavy math. The closest thing we have is NASA where the entire development process is designed and constantly refined in response to individual issues to create the checks and balances with the lofty goal of approaching bug impossibility at an organizational level. Unfortunately this type of evolutionary process is only viable for multi-year projects with 9-figure budgets. It's not going to work for the vast majority of research scientists with limited organizational support.On the positive side, such difficulty is also in the nature of science itself. Scientists already understand that rigorous peer review is the only way to come to reliable scientific conclusions over time. The only thing they need help with understanding is that the software used to come to these conclusions is as suspect as—if not more so than—the scientific data collection and reasoning itself, and therefore all software must be peer-reviewed as well. This needs to be ingrained culturally into the scientific establishment. In doing so, the scientists can begin to attack the problem from the correct perspective, rather than industry software experts coming in and feeding them a bunch of cargo cult "unit tests" and "best practices" that are no substitute for the deep reasoning in the specific domain in question.

评论 #2736353 未加载

评论 #2737370 未加载

评论 #2736199 未加载

gallamine将近 14 年前

I'm a PhD student in Electrical Engineering. I'm currently working on a Monte Carlo-type simulation for looking at the underwater light field for underwater optical communication (no sharks!). I'm doing the development in MATLAB and I recently put all my code up on Github (<a href="https://github.com/gallamine/Photonator" rel="nofollow">https://github.com/gallamine/Photonator</a>) to help avoid some of these problems (lack of transparency). Even if nobody ever looks/uses the code, I know every time I do a commit there's a change someone MIGHT and I think it helps me write better code.The problem with doing science via models/simulation is that there just isn't a good way of knowing when it's "right" (well, at least in a lot of cases), so testing and verification are imperative. I can't tell you how many times I've laid awake at night wondering if my code has a bug in it that I can't find and will taint my research results.I suspect another big problem is that one student writes the code, graduates, then leaves it to future students, or worse, their professor, to figure out what they wrote. Passing on the knowledge takes a heck of a lot of time, especially when you're pressed to graduate and get a paycheck).There's got to be a market in this somewhere. Even if it was just a volunteer service of "real" programmers who would help scientists out. I spent weeks trying to get my code running on AWS, which probably would have taken a few hours from someone who knew what they were doing. I also suspect that someone with practice could make my simulations run at twice the speed, which really adds up when you're doing hundreds of them and they take hours each.

评论 #2736120 未加载

评论 #2736034 未加载

评论 #2736735 未加载

评论 #2736719 未加载

评论 #2737684 未加载

gte910h将近 14 年前

I want to write a "software style guide" for journalists and their editors.Software and Code are both mass nouns in technical language."Code" can be in programs (aka, things that run), libraries (things that other programmers can use to make programs), or in samples to show people how to do things in their programs or libraries. Some people call short programs scripts.When you feel you should pluralize "software", you're doing something wrong. You might want to use the word programs, you might want to use the word products, you might want to just use it like a mass noun "It turns out, thieves broke into the facility and stole some of the water", etc when talking about a theft of software "It turns out, thieves broken into the facility and stole some of the software".

评论 #2735703 未加载

评论 #2735694 未加载

评论 #2735969 未加载

评论 #2736027 未加载

评论 #2737174 未加载

jzila将近 14 年前

My girlfriend is a PhD student in a pharmacology lab. I'm a software engineer working for an industry leader.Once, she and the lab tech were having issues with their analysis program for a set of data. It was producing errors randomly for certain inputs, and the data "looked wrong" when it didn't throw an error. I came with her to the lab on a Saturday and looked through the spaghetti code for about 20 minutes. Once I understood what they were trying to do, I noticed that they had forgotten to transpose a matrix at one spot. A simple call to a transposition function fixed everything.If this had been an issue that wasn't throwing errors, I don't know whether they would have even found the bug. I've been trying to teach my gf a basic understanding of software development from the ground up, and she's getting a lot better. But this does appear to be a systemic problem within the scientific community. As the article notes, more and more complicated programs are needed to perform more detailed analysis than ever before. This problem isn't going to go away, so it's important that scientists realize the shortcoming and take steps to curb it.

评论 #2736099 未加载

评论 #2736143 未加载

评论 #2737709 未加载

GoogleMeElmo将近 14 年前

Yes this is a huge problem. I am a software engineer working at a research institute for bioinformatic. The biggest problem I encounter in my struggle for clean maintainable code, is that management down prioritize this task quite heavily.The researchers produce code of questionable quality that needs to go into the main branch asap. Those few of the researchers that know how to code (we do a lot of image analysis), don't know anything about keeping it maintainable. There is almost a hostile stance against doing things right, when it comes to best practices.The "Works on my computer" seal of approval have taken a whole new meaning for me. Things go from prototype to production by a single correct run on a single data set. Sometimes its so bad I don't know if I should laugh or cry.Since we don't have a single test or, ever take the time to do a proper build system, my job description becomes mostly droning through repetitive tasks and bug hunting. It sucks the life right out of any self respecting developer.There, I needed that. Feel free to flame my little rant down into the abyss. :)

bh42222将近 14 年前

As a general rule, researchers do not test or document their programs rigorously, and they rarely release their codes, making it almost impossible to reproduce and verify published results generated by scientific software, say computer scientists.Just stop doing that!Seriously, testing is not wasted effort and for any project that's large enough it's not slowing you down. For a very small and simple project testing might slow you down, for bigger things - testing makes you faster! And the same goes for documentation. And full source code should be part of every paper.Many programmers in industry are also trained to annotate their code clearly, so that others can understand its function and easily build on it.No, you document code primarily so YOU can understand it yourself. Debugging is twice as hard as coding, so if you're just smart enough to code it, you have no hope of debugging it.

评论 #2735710 未加载

评论 #2735714 未加载

评论 #2735707 未加载

notarealname将近 14 年前

[New account for anonymity]An often neglected force in this argument is that many practitioners of "scientific coding" take rapid iteration to its illogical and deleterious conclusion.I'm often lightly chastised for my tendencies to write maintainable, documented, reusable code. People laugh guiltily when I ask them to try checking out an svn repository, let alone cloning a git repo. It's certain that in my field (ECE and CS) some people are very adamant about clean coding conventions, and we're definitely able to make an impact bringing people to use more high level languages and better documentation practices.But that doesn't mean an hour goes by without seeing results reverse due to a bug buried deep into 10k lines of undocumented C or Perl or MATLAB full of single letter variables and negligible modularity.

评论 #2739102 未加载

brohee将近 14 年前

Next they'll discover than when those scientists leave academia and become quants, they don't magically become any better at coding (but at least they now have access to professionals, if they recognize the need).

gwern将近 14 年前

An interesting citation <a href="http://portal.acm.org/citation.cfm?id=188228" rel="nofollow">http://portal.acm.org/citation.cfm?id=188228</a> :> This paper describes some results of what, to the authors' knowledge, is the largest N-version programming experiment ever performed. The object of this ongoing four-year study is to attempt to determine just how consistent the results of scientific computation really are, and, from this, to estimate accuracy. The experiment is being carried out in a branch of the earth sciences known as seismic data processing, where 15 or so independently developed large commercial packages that implement mathematical algorithms from the same or similar published specifications in the same programming language (Fortran) have been developed over the last 20 years. The results of processing the same input dataset, using the same user-specified parameters, for nine of these packages is reported in this paper. Finally, feedback of obvious flaws was attempted to reduce the overall disagreement. The results are deeply disturbing. Whereas scientists like to think that their code is accurate to the precision of the arithmetic used, in this study, numerical disagreement grows at around the rate of 1% in average absolute difference per 4000 fines of implemented code, and, even worse, the nature of the disagreement is nonrandom. Furthermore, the seismic data processing industry has better than average quality standards for its software development with both identifiable quality assurance functions and substantial test datasets.

评论 #2739492 未加载

saulrh将近 14 年前

Something I heard from one of my professors once: "A programmer alone has a good chance of getting a good job. A scientist alone has a good chance of getting a good job. A scientist that can program, or a programmer that can do science, is the most valuable person in the building."

评论 #2735774 未加载

评论 #2735718 未加载

评论 #2735754 未加载

评论 #2735666 未加载

评论 #2735760 未加载

评论 #2735892 未加载

评论 #2735834 未加载

ajdecon将近 14 年前

(Disclaimer: my background is in materials physics, and it may be different in other fields. But I doubt it.)Unfortunately there is very little direct incentive for research scientists to write or publish clean, readable code:- There are no direct rewards, in the tenure process or otherwise, for publishing code and having it used by other scientists. Occasionally code which is widely used will add a little to the prestige of an already-eminent scientist, but even then it rarely matters much.- Time spent on anything other than direct research or publication is seen as wasted time, and actively selected against. Especially for young scientists trying to make tenure, also the group most likely to write good code. Many departments actually discourage time spent on teaching, and they're paid to do that. Why would they maintain a codebase?- Most scientific code is written in response to specific problems, usually a body of data or a particular system to be simulated. Because of this, code is often written to the specific problem with little regard for generality, and only rarely re-used. (This leads to lots of wheel re-invention, but it's still done this way.) If you aren't going to re-use your code, why would others?- If by some miracle a researcher produces code which is high-quality and general enough to be used by others, the competitive atmosphere may cause them to want to keep it to themselves. Not as bad a problem in some fields, but I hear biology can be especially bad here.- Most importantly, the software is not the goal. The goal is a better understanding of some natural phenomenon, and a publication. (Or in reverse order...) Why spend more time than absolutely necessary on a single part of the process, especially one that's not in your expertise? And why spend 3x-5x the cost of a research student or postdoc to hire a software developer at competitive rates?I went to grad school in materials science at an R1 institution which was always ranked at 2 or 3 in my field. I wrote a lot of code, mostly image-processing routines for analyzing microscope images. Despite it being essential to understanding my data, the software component of my work was always regarded by my advisor and peers as the least important, most annoying part of the process. Time spent on writing code was seen as wasted, or at best a necessary evil. And it would never be published, so why spend even more time to "make it pretty"?I'm honestly not sure what could be done to improve this. Journals could require that code be submitted with the paper, but I really doubt they'd be motivated to directly enforce any standards, and I have no faith in scientists being embarrassed by bad code. Anything not in the paper itself is usually of secondary importance. (Seriously, if you can, check out how bad the "Supplementary Information" on some papers is.) But even making bad code available could help... I guess. And institutions could try to more directly reward time put into publishing good code, but without the journals on board it may be seen as just another form of "outreach"--i.e., time you should have been in lab.I did publish some code, and exactly two people have contacted me about it. That does make me happy. But many, many more people have contacted me to ask about how I solved some problem in lab, or what I'm working on now that they could connect with. (And are always disappointed when I tell them I left the field, and now work in high-performance computing.) Based on the feedback of my peers... well, on what do you think I should've spent my time?

评论 #2736911 未加载

评论 #2737518 未加载

评论 #2736970 未加载

评论 #2736704 未加载

评论 #2736737 未加载

评论 #2738031 未加载

评论 #2737642 未加载

评论 #2738461 未加载

arctangent将近 14 年前

I think it is unreasonable to expect that a person will be a good programmer just because (a) they are a scientist and (b) their current project can be assisted by computers.Is it not sensible, perhaps, to have a dedicated group of programmers (with various specialities) available as a central resource to assist the scientists with their modelling? (I am imagining a central pool whose budget would be spread over several areas.)I personally love working on toy projects related to science. Maybe we hackers with time for that kind of thing should volunteer in some way to assist with the technical aspects of research that is directed by a scientist? I'm not sure I'd even care about getting a credit on a research paper so long as I could post pretty pictures and graphs on my blog...

ANH将近 14 年前

From personal experience, I attest that it can be more difficult than pulling teeth to get a scientist to commit code to a version control system.

评论 #2736046 未加载

scott_s将近 14 年前

One of the main sources in the article is a study from the 2009 Workshop on Software Engineering for Computational Science and Engineering. One of the workshop's organizer's has a report of the overall conference which is interesting: <a href="http://cs.ua.edu/~carver/Papers/Journal/2009/2009_CiSE.pdf" rel="nofollow">http://cs.ua.edu/~carver/Papers/Journal/2009/2009_CiSE.pdf</a>

mclin将近 14 年前

Rather than building these data analysis/visualization programs from scratch each time, my thought is that scientists should instead be writing them as modules for a data workflow application like RapidMiner.If you haven't heard of RapidMiner, you basically edit a flowchart where each step takes inputs and outputs, eg take some data and make a histogram, or perform a clustering analysis.Video of someone demoing it: <a href="http://www.youtube.com/watch?v=TNESlvXp47E" rel="nofollow">http://www.youtube.com/watch?v=TNESlvXp47E</a>This way, the scientists can focus on the algorithms and not have to worry about all the other details of creating useable, maintainable software.

评论 #2736072 未加载

评论 #2736231 未加载

gwern将近 14 年前

There are a lot of suggestions that the code and data be required to publish.Sorry guys, but that hasn't worked so far: the economics journal _Journal of Money, Credit and Banking _, which required researchers provide the data & software which could replicate their statistical analyses, discovered that <10% of the submitted materials were adequate for repeating the paper (see "Lessons from the JMCB Archive", Volume 38, Number 4, June 2006).Oops.

sliverstorm将近 14 年前

Why not just hire comp scientists or programmers permanently? Adjust the company model, permanently segregate the work?

评论 #2735734 未加载

评论 #2735713 未加载

评论 #2735914 未加载

评论 #2735704 未加载

评论 #2736190 未加载

评论 #2735743 未加载

snissn将近 14 年前

there's not nearly enough open source academic projects, nor is there any sort of pervasive culture that encourages one.. besides the litany of examples that could be put together to show that open source + academia does exist and does work, I've read way too many computational physics or computational chemistry or computational anything academic papers that simply do not publish source code, and imo there's no good excuse for it, other than the usual, funding, or copyright / university IP

评论 #2735975 未加载

rflrob将近 14 年前

Where do most programmers get this exposure to best practices like version control, unit testing, etc? I took a few early-mid level CS classes, and there was a relatively cursory emphasis on readable code, there was barely any on any of the sorts of things that lead to well-maintained projects. If these are the sorts of things that one learns at your first internship, then it's no wonder that academics in other disciplines don't have any exposure to it.

评论 #2736302 未加载

jleyank将近 14 年前

This is a difficult situation. Is it easier to train the domain experts to be competent programmers or train the competent programmers to be domain experts? In a research environment, I worry there's little time or interest in developing specs that can change in an instant or can't be written until the physics is understood.We find it quite difficult trying to get programming out of people who don't know why Carbon has 4 bonds while Nitrogen has 3, for example.

评论 #2735925 未加载

radarsat1将近 14 年前

I think there are multiple reasons for this problem, and only one of them is a lack of training in software management. Another problem is that science is an inherently exploratory procedure. You design an experiment, gather some data, and then go about analyzing it. You have an idea of what you'll find, but depending on what you get, you might need to then reformat/restructure the data, transform it, cut it up, etc.The problem is that this represents one of the worst problem cases in software design: evolving requirements. By itself this is bad enough. Recently I have been analysing data from a recent study. You start off with a data structure that you think represents things, but then you notice for example you need to synchronize several recordings; now you have to track time. You realize some recordings need to be split down the middle to aid in synchronization; now you need to add a 'part' field. You derive some value from several data points that takes a long time to compute, so you need to create a file to hold it. This needs to be kept in synch with the original data. Eventually you realize that text files aren't going to cut it; you start moving things to a database. Now you need to reconfigure your visualization program to read from the database. Then you realize that you want to add another similar derivative value, but this time it's a 3x3 matrix for each data point; time to extend the database again. etc.. etc.. Eventually you decide it would be best to really rewrite the codebase because it's becoming impossible to work with. Unfortunately the paper is due soon and you just need to generate a few more graphs..And I didn't even mention the growing directory of scripts that aren't properly organized into modules, that end up with copy-pasted code because it's not very clear how to cleanly put this into a function, or which module it should belong to.Now, this is bad enough when you have a CS degree and have designed several software frameworks in your life. Combine this with someone who knows nothing about software architecture and you have a really big problem on your hands. My point is this: it happens to the best of us, no matter how hard you try to organizing things, when you don't have the requirements available ahead of time.The best approach I've found is to force myself to simply write functions as small as possible, that do one simple thing at a time. I try to break up functions as much as possible for reuse, and avoid copy-pasting code at all costs. Admittedly it's not always easy, sometimes a function that generates a particular graph just needs a certain number of lines of logic, and it's very difficult to modularize. Then you find that you want a similar graph but with a slightly different transformation on the Y axis... etc.. etc..

cool-RR将近 14 年前

My approach is to have the scientist write as little code as possible. That's why I'm working on GarlicSim:<a href="http://garlicsim.org" rel="nofollow">http://garlicsim.org</a>GarlicSim's goal is to do all the technical, tedious work involved in writing a simulation while letting the scientist write only the code that's relevant to his field of study.

salva_xf将近 14 年前

Could be that they are not using the correct language, If they have some domain specific language on top of common lisp for example, they will have much better code with less work, i think

JonnieCache将近 14 年前

Maybe we can do an outreach program? Hackers adopting scientists?

评论 #2738054 未加载

评论 #2737562 未加载

sc68cal将近 14 年前

This is why partnering is key. I partnered up with a geneticist who understands his subject matter, while I can focus on my subject matter.The end result was a grant funded by NIH.

jostmey将近 14 年前

Incorporate the ability and require the usage of units! Problem partially solved :-)

gte910h将近 14 年前

Is there a git client for the unwilling?I could see that solving some of the issues.

tedjdziuba将近 14 年前

I know of a company, made up of scientists from academia, that develops software by writing the code (or "codes" as they call it) in Microsoft Word documents and e-mailing them to eachother.Some how, they are still in business.True story.

评论 #2736121 未加载

评论 #2737399 未加载

28 条评论

dasil003将近 14 年前

评论 #2736353 未加载

评论 #2737370 未加载

评论 #2736199 未加载

gallamine将近 14 年前

评论 #2736120 未加载

评论 #2736034 未加载

评论 #2736735 未加载

评论 #2736719 未加载

评论 #2737684 未加载

gte910h将近 14 年前

评论 #2735703 未加载

评论 #2735694 未加载

评论 #2735969 未加载

评论 #2736027 未加载

评论 #2737174 未加载

jzila将近 14 年前

评论 #2736099 未加载

评论 #2736143 未加载

评论 #2737709 未加载

GoogleMeElmo将近 14 年前

bh42222将近 14 年前

评论 #2735710 未加载

评论 #2735714 未加载

评论 #2735707 未加载

notarealname将近 14 年前

评论 #2739102 未加载

brohee将近 14 年前

gwern将近 14 年前

评论 #2739492 未加载

saulrh将近 14 年前

评论 #2735774 未加载

评论 #2735718 未加载

评论 #2735754 未加载

评论 #2735666 未加载

评论 #2735760 未加载

评论 #2735892 未加载

评论 #2735834 未加载

ajdecon将近 14 年前

评论 #2736911 未加载

评论 #2737518 未加载

评论 #2736970 未加载

评论 #2736704 未加载

评论 #2736737 未加载

评论 #2738031 未加载

评论 #2737642 未加载

评论 #2738461 未加载

arctangent将近 14 年前

ANH将近 14 年前

From personal experience, I attest that it can be more difficult than pulling teeth to get a scientist to commit code to a version control system.

评论 #2736046 未加载

scott_s将近 14 年前

mclin将近 14 年前

评论 #2736072 未加载

评论 #2736231 未加载

gwern将近 14 年前

sliverstorm将近 14 年前

Why not just hire comp scientists or programmers permanently? Adjust the company model, permanently segregate the work?

评论 #2735734 未加载

评论 #2735713 未加载

评论 #2735914 未加载

评论 #2735704 未加载

评论 #2736190 未加载

评论 #2735743 未加载

snissn将近 14 年前

评论 #2735975 未加载

rflrob将近 14 年前

评论 #2736302 未加载

jleyank将近 14 年前

评论 #2735925 未加载

radarsat1将近 14 年前

cool-RR将近 14 年前

salva_xf将近 14 年前

Could be that they are not using the correct language, If they have some domain specific language on top of common lisp for example, they will have much better code with less work, i think

JonnieCache将近 14 年前

Maybe we can do an outreach program? Hackers adopting scientists?

评论 #2738054 未加载

评论 #2737562 未加载

sc68cal将近 14 年前

This is why partnering is key. I partnered up with a geneticist who understands his subject matter, while I can focus on my subject matter.The end result was a grant funded by NIH.

jostmey将近 14 年前

Incorporate the ability and require the usage of units! Problem partially solved :-)

gte910h将近 14 年前

Is there a git client for the unwilling?I could see that solving some of the issues.

tedjdziuba将近 14 年前

评论 #2736121 未加载

评论 #2737399 未加载