Recently I've been working on a project that required me to search a limited set of domains.
1)I tried Google Custom Search but they restrict usage to 10000 API calls per day at $50/day. I cannot go higher than this even if I wanted to pay.
2)So I decided to write a crawler that will go to these domains and index their pages so I can put them in ElasticSearch and create a basic search engine.
3)When I created the crawler I came across domains that do not allow crawling their content e.g. Yelp, Craigslist, etc. have locked down their sites for crawling. However I see results from these on Google.
4)It seems the only option is to use their APIs (if they provide one) to get data from these domains. This can be a nightmare to maintain as the number of domains increases.
5)I want to respect the domains' policies and not use shady tactics to crawl their pages.<p>So essentially these domains allow google and bing to crawl their sites because they are big and established and not having their pages show up on Google or Bing would drastically affect their web traffic but smaller startups would be left out in the cold.<p>So my question is: What is the possibility of a new search engine emerging if the web is locked down for crawlers?
Have you tried blekko.com ? The slashtags feature specifically does what you are describing here.<p>Of course, using this method will not get you data from sites that explicitly ban crawling (like yelp.com).<p>Disclaimer: I used to work at blekko and I implemented the slashtags feature.
This is slightly off-topic, but still very much related.<p>There will never be an all encompassing search engine to rival Google. I would argue, one could not even rival Bing.<p>The future of search, as with most internet companies moving forward, is all about mastering a niche. You may not be able to beat Google as an overall search engine, but if you dedicate your time and money towards one specific niche, say sports, then you have a fighting chance to beat them at that.
If the service is exposed publicly to the web, It can be crawled regardless of whatever guards are in place by the service provider. Browser emulation will be a good start.
It's all about innovating and offering something that Google doesn't offer, such as privacy or smart widgets. See: DuckDuckGo.com<p>You're not going to crawl / index better than Google.<p>Maybe you could combine the DMOZ directory and Google results somehow.<p>Rather than replacing Google, try to supplement their services or just somehow provide more value to users.
if the cap on google custom search is the only roadblock, then why not use bing's:<p><a href="http://datamarket.azure.com/dataset/8818F55E-2FE5-4CE3-A617-0B8BA8419F65" rel="nofollow">http://datamarket.azure.com/dataset/8818F55E-2FE5-4CE3-A617-...</a>