TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: I want to crawl every plain HTML website. Where do I begin?

4 pointsby dwrodriover 2 years ago
Web crawling in this day and age is loaded with caveats, and must be done carefully as to not put an unnecessary load on other people&#x27;s infrastructure. I&#x27;d really like to try to make my own web search tool, so I&#x27;m trying to scope out something simple enough to get me started. Here&#x27;s what I have so far.<p>- I don&#x27;t want to parse anything in the SimilarWeb Top 50.<p>- I don&#x27;t want to render JS<p>- I&#x27;d like to keep a web index that is still measured in TBs<p>I&#x27;ve done search engines for research papers in the past. The key difference was that I could collect that data very easily through a documented API. Now I need to either build or use a crawler and I&#x27;m not sure where to begin. Here are some thoughts that I have so far.<p>- I&#x27;m probably going to write the crawler in Go. It seems like a good fit for this sort of software.<p>- How do I start collecting lists of domains? Do I just start hitting public IPv4 addresses on Ports 80 and 443?<p>- If I run something like this with proper rate limiting on a server, would Cloudflare inevitably just start blocking me?<p>- If I were to run this from a machine connected to the Internet via a residential ISP, then would I get a nasty letter from my ISP?<p>Any advice or feedback is appreciated. The goal of this project is to learn more about web crawling moreso than to build a product that would be sold.

3 comments

ratio11over 2 years ago
You might be interested in Common Crawl. They crawl the internet and make the full dataset downloadable.<p><a href="https:&#x2F;&#x2F;commoncrawl.org" rel="nofollow">https:&#x2F;&#x2F;commoncrawl.org</a>
heresjohnnyover 2 years ago
More and more sites are driven by lazily loaded content, though – for which javascript is a prerequisite. Do note that you’re excluding a significant amount of sites this way.
评论 #33837675 未加载
deepsyover 2 years ago
I&#x27;d probably look into AWS Lambda as you can query in parallel from different IPs at a scale cheaply.