A couple of months ago I processed all metadata from the Common Crawl project for all indexed domain names. This was about 10TB of metadata and resulted in 26 million domain names. EC2 costs were only about 10$ to process this. If anyone is interested, let me know.<p>edit: available as torrent here: <a href="https://all-certificates.s3.amazonaws.com/domainnames.gz?torrent" rel="nofollow">https://all-certificates.s3.amazonaws.com/domainnames.gz?tor...</a>
A warning about parsing zone files... the grammar is deceptively tricky.<p>While TLD registries will <i>probably</i> provide you with files in a sane subset[0] of that specified in RFC 1035, there are a number of things that will <i>NOT</i> work in general:<p>- Splitting the file in to lines (paren-blocks and quoted strings can span lines, strings can contain ';' etc).<p>- Splitting the file on whitespace (it's significant in column 1 and inside strings)<p>- Applying a regex (you'll need lookahead for conditional matching and it'll get ugly fast)<p>Don't go down the road of assuming it's a simple delimited file.<p>A few references:<p><a href="https://www.nlnetlabs.nl/projects/nsd/documentation.html" rel="nofollow">https://www.nlnetlabs.nl/projects/nsd/documentation.html</a><p><a href="http://www.verycomputer.com/96_5ad11cc47053d8b0_1.htm" rel="nofollow">http://www.verycomputer.com/96_5ad11cc47053d8b0_1.htm</a><p>[0] See page 9 of <a href="https://archive.icann.org/en/topics/new-gtlds/zfa-strategy-paper-12may10-en.pdf" rel="nofollow">https://archive.icann.org/en/topics/new-gtlds/zfa-strategy-p...</a>
I had been downloading the zone file for .PK domains on daily bases until they blocked the zone transfers. Based on comparison of these daily zone files I managed to publish the statistics [1] and also broke the news about hacked .PK domains [2] which was picked up by all leading tech blogs and news agencies.<p>Currently, I cannot find a way to get the zone file even by officially requesting the registry manager.<p>[1]: <a href="https://www.i.com.pk/pknic-domain-registration-statistics/" rel="nofollow">https://www.i.com.pk/pknic-domain-registration-statistics/</a><p>[2]: <a href="https://www.i.com.pk/110-pk-domains-managed-by-markmonitor-got-hacked-by-turkish-hackers/" rel="nofollow">https://www.i.com.pk/110-pk-domains-managed-by-markmonitor-g...</a>
What if someone were to maintain an unofficial list with one domain per line, freely available as a daily torrent or served directly? Would there be a rights problem with mirroring and filtering ICANN data?
FWIW, a TLD zone file does not contain every registered domain name, just those with DNS records. There is typically a good amount of domain names registered but without records, for reasons such as reserved names, malicious content takedowns, etc.
Interesting, thanks for the pointer.<p>I've wondered about this previously as I run my own blacklists for $work's mail servers, thinking about how I could slightly "penalize" brand new domain names and such, correlating "spammy" domains with certain nameservers and such.