TechEcho

Why a service and not a library?It looks like a great way for you to discover URLs but like a terribly slow way for people to avoid implementing robots.txt rules.

While this looks good, I don't think it's feasible for a web crawler in most cases. Crawlers want to crawl a ton of URLs and it would have to make a request to your service for each and every URL.What's the plan here? Check for a sitemap.xml (which generally only contains crawlable URLs anyway) or crawl the index and look for all links and send a request to your service for every URL before crawling it?I personally think it would be better suited as a library where you can pass it a robots.txt and it'll let you know if you can crawl a URL based on that.

The service is very nice and I understand your reason for developing it. I see this service to be having more value in helping companies find all the web pages, rather than just the allowed ones.I understand the unethical nature of the above method, however, I see it happening quite a lot in practice.

I created this project to use in my projects. It is open source. You can use it if you are implementing SEO tools or a web crawler. Note that this is a first alpha release.Give me some feedback!

Why a service and not a library?It looks like a great way for you to discover URLs but like a terribly slow way for people to avoid implementing robots.txt rules.

I created this project to use in my projects. It is open source. You can use it if you are implementing SEO tools or a web crawler. Note that this is a first alpha release.Give me some feedback!

Show HN: robots.txt as a service, check web crawl rules through an API

4 comments

Show HN: robots.txt as a service, check web crawl rules through an API

4 comments