First of all, it’s hidden sevices, not dark web.<p>Second, to anyone crawling hidden services or crawling over tor, please run a relay or decrease your hop. Don’t sacrifice other’s desperate need for anonymity for your $whatever_purpose_thats_probably_not_important. It could be some fun thing to do for you, but some people are relying on tor to use the free, secure and anonymous Internet.
Disclaimer: I have rather small experience with Golang and just skimmed the crawler code.<p>From what I could see, author made effort to make the crawler distributed with k8s (which I don't is needed considering there are only approximately 75 000 onion addresses) using modern buzzword technology, but from what I could see the crawler itself is rather simplistic. It doesn't even seem to index/crawl relative urls, just absolute ones.
To anyone experimenting with such stuff, <i>take care</i> and don't make your services publically available. Especially the dark web is full with highly illegal content such as child pornography and in some jurisdictions even "involuntary possession" such as in browser caches may be enough to convict you.
I’ve been pretty surprised at how big hidden services have become<p>Dread, the dark net reddit, is surprisingly vibrant<p>I think its weird that people almost don't <i>want</i> to hear positive stories about dark net.<p>It’ll be funny when news articles and romcoms just start “forgetting” to qualify their plot piece with the “its scary” trope
Crawlers are fun!<p>If you're new to the field and want something that's easy to set up & polite, I strongly recommend Apache Storm Crawler (<a href="https://github.com/DigitalPebble/storm-crawler" rel="nofollow">https://github.com/DigitalPebble/storm-crawler</a>).
A well written article with lot of technical details. Well done.<p>However, I'm wondering what would be a good practical purpose of crawling dark web.
I did the same in Racket when I made a Tor search engine. Here's the source code of the crawler!<p><a href="https://github.com/torgle/torgle/blob/master/backend/torgle.rkt" rel="nofollow">https://github.com/torgle/torgle/blob/master/backend/torgle....</a>
Any http-aware software that supports socks proxies can access information on hidden services, so any crawler can do it. I fail to see what is novel about that, except that it uses k8s and mongo and a catchy blog title.
How well does it handle a gzip bomb? <a href="https://www.hackerfactor.com/blog/index.php?/archives/762-Attacked-Over-Tor.html" rel="nofollow">https://www.hackerfactor.com/blog/index.php?/archives/762-At...</a>
Go is a horrible language in which to write a crawler. The main problem is that NLP and machine learning code simply isn't as prevalent and robust as it is in Java and Python.