There is a comment which was voted dead which actually answered the question.
These services have their own crawlers. If you ever spot for example MajesticBot in your access logs you have found one of the biggest.
They use distributed web crawlers to crawl 100s of billions of web pages. Probably one of the following options:<p>1) Built their own crawlers.<p>2) Using an Apache Nutch/Heritrix cluster in a colo facility.<p>3) Use 3rd party services like mixnode.