Hah! tectonic and I applied to YC with almost exactly this in 2009?!<p>We went as far as building a browser-based IDE-like environment for generating these, and a language called parsley for expressing the scrapes. If you're interested in this, you could check out some of our related open source libraries:<p>Edit: I just open-sourced the scraping wiki site we created here: <a href="https://github.com/fizx/parselets_com" rel="nofollow">https://github.com/fizx/parselets_com</a><p><a href="http://selectorgadget.com" rel="nofollow">http://selectorgadget.com</a><p><a href="https://github.com/fizx/parsley" rel="nofollow">https://github.com/fizx/parsley</a><p><a href="https://github.com/fizx/parsley-ruby" rel="nofollow">https://github.com/fizx/parsley-ruby</a><p><a href="https://github.com/fizx/pyparsley" rel="nofollow">https://github.com/fizx/pyparsley</a><p><a href="https://github.com/fizx/csvget" rel="nofollow">https://github.com/fizx/csvget</a><p><pre><code> > cat hn.let
{
"headlines":[{
"title": ".title a",
"link": ".title a @href",
"comments": "match(.subtext a:nth-child(3), '\\d+')",
"user": ".subtext a:nth-child(2)",
"score": "match(.subtext span, '\\d+')",
"time": "match(.subtext, '\\d+\\s+\\w+\\s+ago')"
}]
}
> csvget --directory-prefix=./data -A "/x" -w 5 --parselet=hn.let http://news.ycombinator.com/
> head data/headlines.csv
comments,title,time,link,score,user
4,Simpson's paradox: why mistrust seemingly simple statistics,2 hours ago,http://en.wikipedia.org/wiki/Simpson%27s_paradox,41,waldrews
67,America's unjust sex laws,2 hours ago,http://www.economist.com/opinion/displaystory.cfm?story_id=14165460,59,MikeCapone
23,Buy somebody lunch,3 hours ago,http://www.whattofix.com/blog/archives/2009/08/buy-somebody-lu.php,58,DanielBMarkham</code></pre>