TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: A bit of direction from people who might know. Crawling/screen scraping

1 pointsby inovicaalmost 12 years ago
I have been working on a tool to try to &#x27;train&#x27; a crawler to extract specific elements of a page. I could do with some advice on where to take it from here. here&#x27;s how it currently works:<p>1) It has a queue of domains that I have pre-processed. For the initial purposes I&#x27;ve restricted it to pages that I think are ecommerce based on $ signs, add to cart&#x2F;basket type links etc<p>2) There is a visual tool that I then use to select certain parts of the page - eg price, product, image etc. I save these out as xpaths<p>3) Once I have done one URL I send a crawler to that domain and extract other pages that fit the profile of an ecommerce page and try to use the same mapping as number 2 above to extract the data<p>I have done a small video to show it in action:<p>http:&#x2F;&#x2F;www.screencast.com&#x2F;t&#x2F;riB3iiVMiSk<p>I&#x27;m not sure if I&#x27;m doing this the right way. If a site&#x2F;page changes structure then I may have to re-map the data. I was hoping that someone would have some pointers for me in terms of any other ways to do this. Also with Javascript-heavy sites I&#x27;ve had some problems<p>If anyone has any knowledge of screen scraping, where it can be done more automatically, I&#x27;d really appreciate a steer!<p>Thanks<p>Ade

no comments

no comments