Ask HN: Whats the best set of tools do Structure crawled web pages?

2 点作者 lucasrp将近 12 年前

Hello everybody,I have to scrape ~1k news sources (among other types of content) on the web, and extract data like title, author, date, news body, etc.Right now we use a horrible inhouse code (And Jsoup) to parse it. The problem is that we rely on regex expressions and css colectors to do it. As you can imagine, the maintanance cost is very high, because everytime some source changes their template, we have to do it again, by hand.We are interested in doing the whole thing from scratch, and i would like to now which tools, or set of tools, would be better to do a more inteligent approach. i've had a nice experience with antlr building a date parser, for example.Any suggestions?

1 comment

palidanx将近 12 年前

I use the Mechanize gem for rails<a href="http://mechanize.rubyforge.org/" rel="nofollow">http://mechanize.rubyforge.org/</a>