One thing I hope this project does that Google fails to do is give developers a good API to access search. Google closed down their first web search API and now only give developers access to a limited Custom Search API that's rate limited to 100 queries a day for free with a hard limit of 10k searches - that makes it either very hard to develop anything against or relatively expensive. There are other options (Bing, Faroo, raw access to CommonCrawl) but they're either low quality or hard to work with. A good quality, straightforward, open web search API would be awesome.
I've tried using different search engines to Google numerous times, but each time I've returned to Google simply because the searches are better. They're more accurate, more relevant, and I very rarely find myself searching more than once to find something.<p>If commonsearch can beat Google in that regard, then count me in. But I doubt it will.
I think people might underestimate the power of an open source search engine. In my eyes it is like wikipedia versus the old paper encyclopedia books. Improvements to search results in Google are done by a relatively small amount of people from Google. Google decides where you buy, what you think and how you live. Behind their algorithms they probably have made dozens of subjective choices. Public debate, more attention to details, and open politics are as I see it, great tools to improve search engine quality.
I like the project's goal but as techies, we inevitably want to understand the technical details and how it helps (or handicaps) the search results in comparison with Google.<p>For example, the project's data sources[1] says that the bulk of data comes from The Common Crawl. It looks like the CC is ~150 TB of data[2]. I'm not familiar with google.com internals but various sources estimate that their proprietary crawl dataset is more than a petabyte. (A googler could chime in here with more accurate data.)<p>So it's not as simple as the <i>algorithm</i> for Common Search being "more fair" than the algorithm for Google Inc. The underlying dataset in terms of quantity, recency, rules for the robot, etc all affect the algorithm.<p>This is not a criticism of the project. It is my attempt to understand what is not obvious on the surface level.<p>[1]<a href="https://about.commonsearch.org/data-sources" rel="nofollow">https://about.commonsearch.org/data-sources</a><p>[2]<a href="http://commoncrawl.org/2015/12/november-2015-crawl-archive-now-available/" rel="nofollow">http://commoncrawl.org/2015/12/november-2015-crawl-archive-n...</a><p>(I'm can't tell if each archive of MM/YYYY is cumulative or an addendum.)
Seeing how the founder is the same who founded Jamendo which later was turned into a sad, user-unfriendly attempt to make money with freely licenses music (destroying its community in the process), how can I trust commonsearch not to be a waste of time and attention?
I'm trying to find out from their website, but it's unclear. Are the servers hosted in the USA? And will the organisation be incorporated in the USA?<p>If you're talking about privacy and transparency, it's better to operate in a place bound the European Charter of Fundamental Rights, rather than the US Constitution, because the former gives people <i>much</i> more rights with their data, how it's used, etc.
I like it!<p>The explainer tool gives a really cool insight into the results: <a href="https://explain.commonsearch.org/" rel="nofollow">https://explain.commonsearch.org/</a>
Neat, I was working on a project to give a full programmatic keyword index to the contents of the common crawl, but I guess there's no need! It's very exciting to consider what kind of applications you can build with this.
I'd love to see a Wikipedia styled search where people can improve or flag results as they see fit. I wonder if that has been tried.<p>Sure it might not handle the long long tail but the top ten million searches would still be pretty useful.
This sounds awesome. Speaking of building AIs/bots and such in your FAQ, the lack of a good open API for search is probably what gates that market to Google and Microsoft and such... That nobody else can just tap a search engine. I'd love to be able to connect to this for queries at some point.
"nonprofit" for me is a bad smell. I.e. the problem of sustainability, which for nonprofits is all about the money and not about carbon or solar energy, rainbows, plutonium or any of that.