Well I guess I'll pat myself on the back for always using git. Having seen a lab's major university research project literally evaporate due to lack of proper version control I've been keen on it ever since. Besides, I like the incremental feel of accomplishment when I make the next commit.<p>I agree that setting up a dedicated pipeline early on in a project and committing yourself to work within its confines aids organization, but it also contributes to putting your creativity in the right place. We often enjoy building things for the sake of building and those of us with stronger proclivities towards engineering can sometimes get too jolly putting together new pipes when we should be training models. I have been guilty in the past of writing a Perl script that spawned shell scripts that cued Perl scripts on a cluster that ran R scripts that piped back into Perl scripts. Then again I always liked the boardgame Mouse Trap.
Cool article, thanks. I used to spend 90% of my downtime trying to improve my stats and machine learning knowledge, but in the last couple years I've come to realize how much my lack of proper engineering was hurting me.<p>A (flawed) analogy: if data science is storytelling, stats is the story and engineering is the words you use to tell it. You need to do well at both to effectively tell your story.<p>>...good engineering going out the window quickly with elaborate ensembling.<p>This is one of my criticisms of data mining contests (sorry kaggle!). When I was in grad school I liked doing these contests to get practical experience - my last company actually recruited me through one. But as they got more popular I found the best engineers and data scientists stopped having as much of a chance of winning. Good modelers get 90% of the way there and then are beaten by impractical solutions that would get someone fired from a job.
It appears our blog is down. Oof. Apologies for the inconvenience. While we're figuring it out on it here is a mirror of the post: <a href="http://blog.untrod.com/2012/10/engineering-practices-in-data-science.html" rel="nofollow">http://blog.untrod.com/2012/10/engineering-practices-in-data...</a><p>Edit: And we're back up.
I love definitions that wax my ego! I grok revision control, and can sling enough R / stats to officially make me a nerd of an urban planner. Does this mean I can start calling myself an "urban data scientist"? I definitely need a pay raise.