TechEcho

Web crawling in this day and age is loaded with caveats, and must be done carefully as to not put an unnecessary load on other people's infrastructure. I'd really like to try to make my own web search tool, so I'm trying to scope out something simple enough to get me started. Here's what I have so far.- I don't want to parse anything in the SimilarWeb Top 50.- I don't want to render JS- I'd like to keep a web index that is still measured in TBsI've done search engines for research papers in the past. The key difference was that I could collect that data very easily through a documented API. Now I need to either build or use a crawler and I'm not sure where to begin. Here are some thoughts that I have so far.- I'm probably going to write the crawler in Go. It seems like a good fit for this sort of software.- How do I start collecting lists of domains? Do I just start hitting public IPv4 addresses on Ports 80 and 443?- If I run something like this with proper rate limiting on a server, would Cloudflare inevitably just start blocking me?- If I were to run this from a machine connected to the Internet via a residential ISP, then would I get a nasty letter from my ISP?Any advice or feedback is appreciated. The goal of this project is to learn more about web crawling moreso than to build a product that would be sold.

3 comments

ratio11over 2 years ago

You might be interested in Common Crawl. They crawl the internet and make the full dataset downloadable.<a href="https://commoncrawl.org" rel="nofollow">https://commoncrawl.org</a>

heresjohnnyover 2 years ago

More and more sites are driven by lazily loaded content, though – for which javascript is a prerequisite. Do note that you’re excluding a significant amount of sites this way.

评论 #33837675 未加载

deepsyover 2 years ago

I'd probably look into AWS Lambda as you can query in parallel from different IPs at a scale cheaply.

3 comments

ratio11over 2 years ago

You might be interested in Common Crawl. They crawl the internet and make the full dataset downloadable.<a href="https://commoncrawl.org" rel="nofollow">https://commoncrawl.org</a>

heresjohnnyover 2 years ago

More and more sites are driven by lazily loaded content, though – for which javascript is a prerequisite. Do note that you’re excluding a significant amount of sites this way.

评论 #33837675 未加载

deepsyover 2 years ago

I'd probably look into AWS Lambda as you can query in parallel from different IPs at a scale cheaply.

Ask HN: I want to crawl every plain HTML website. Where do I begin?

3 comments

Ask HN: I want to crawl every plain HTML website. Where do I begin?

3 comments