TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: robots.txt as a service, check web crawl rules through an API

17 pointsby fooockalmost 6 years ago

4 comments

dscplsalmost 6 years ago
Why a service and not a library?<p>It looks like a great way for you to discover URLs but like a terribly slow way for people to avoid implementing robots.txt rules.
评论 #20556573 未加载
tehwhalealmost 6 years ago
While this looks good, I don&#x27;t think it&#x27;s feasible for a web crawler in most cases. Crawlers want to crawl a ton of URLs and it would have to make a request to your service for each and every URL.<p>What&#x27;s the plan here? Check for a sitemap.xml (which generally only contains crawlable URLs anyway) or crawl the index and look for all links and send a request to your service for every URL before crawling it?<p>I personally think it would be better suited as a library where you can pass it a robots.txt and it&#x27;ll let you know if you can crawl a URL based on that.
评论 #20556810 未加载
itsmefazalmost 6 years ago
The service is very nice and I understand your reason for developing it. I see this service to be having more value in helping companies find all the web pages, rather than just the allowed ones.<p>I understand the unethical nature of the above method, however, I see it happening quite a lot in practice.
评论 #20556888 未加载
fooockalmost 6 years ago
I created this project to use in my projects. It is open source. You can use it if you are implementing SEO tools or a web crawler. Note that this is a first alpha release.<p>Give me some feedback!