TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Who needs to scrape millions of pages, or monitor them?

5 pointsby calufaover 11 years ago
Hi,<p>For the last year I have been working on a easy to use web scraper called Tales. Tales is written in java. Tales uses http apis to start scraping. It has an html dashboard where you can see in real time things like: memory, cpu, pages per second scraped, errors, server health, and other dev-friendly goodies.<p>Tales gives you an out of the box way of scraping html and put them into s3 (i.e ...:8080&#x2F;start?process=tales.scrapers.LoopScraper -template tales.templates.DynamicDataDownloader -threads 2 -namespace com_twitter -baseURL twitter.com), but you can also extend it for custom scraping logics. An example of custom scraping could be to scrape title, ratings, images, blobs, and store it into mysql using simple tales java apis.<p>Tales is made of interesting services. Among them we can find:<p>- GitSync: maintains code in the server up to date, all you need is to push from your local computer. - DirListener: among other important things it compiles the services every time it sees a change. - ServerMonitor: keeps track of the server health. - S3DBBackup, S3DBRestore: backups and restores databases -- You may ran out of space, or want to move.<p>Tales can run as many threads as you like, it uses little memory and cpu, and run for days. It can run in many servers at the same time, with all the databases located in 1 place, or distributed across the servers, all manageable via java apis or the config file.<p>Tales can also failover to another server when blocked. The failover logic uses a java interface, with it you can write custom ip pooling logics.<p>Tales had scraped 10s of millions of pages across many domains.<p>* Source https:&#x2F;&#x2F;github.com&#x2F;calufa&#x2F;tales-core * Documentation is old, I will update it soon.<p>I am currently working with big data -- solr, OpenNLP, all that sugar -- and I needed data from custom sources and I didn&#x27;t want to run 10 shells to get that done.<p>calufa@gmail.com<p>linkedin.com&#x2F;in&#x2F;calufa

2 comments

webvetover 11 years ago
Cool !! Is it robots.txt compliant? If not, it might be a good idea to make this available as an option&#x2F;parameter.<p>For &#x27;quick and dirty&#x27; tasks, wget -r can come in handy too.
评论 #6359972 未加载
volokoumpheticoover 11 years ago
very cool, is it doing a depth first blind crawl of any domain you throw at it?
评论 #6357662 未加载