TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Whats the best set of tools do Structure crawled web pages?

2 pointsby lucasrpalmost 12 years ago
Hello everybody,<p>I have to scrape ~1k news sources (among other types of content) on the web, and extract data like title, author, date, news body, etc.<p>Right now we use a horrible inhouse code (And Jsoup) to parse it. The problem is that we rely on regex expressions and css colectors to do it. As you can imagine, the maintanance cost is very high, because everytime some source changes their template, we have to do it again, by hand.<p>We are interested in doing the whole thing from scratch, and i would like to now which tools, or set of tools, would be better to do a more inteligent approach. i&#x27;ve had a nice experience with antlr building a date parser, for example.<p>Any suggestions?

1 comment

palidanxalmost 12 years ago
I use the Mechanize gem for rails<p><a href="http://mechanize.rubyforge.org/" rel="nofollow">http:&#x2F;&#x2F;mechanize.rubyforge.org&#x2F;</a>