Isn't this what Common Crawl[1] is. From their FAQ:<p>> What is Common Crawl?<p>> Common Crawl is a 501(c)(3) non-profit organization dedicated to providing a copy of the internet to internet researchers, companies and individuals at no cost for the purpose of research and analysis.<p>> What can you do with a copy of the web?<p>> The possibilities are endless, but people have used the data to improve language translation software, predict trends, track the disease propagation and much more.<p>> Can’t Google or Microsoft just do that?<p>>Our goal is to democratize the data so everyone, not just big companies, can do high quality research and analysis.<p>Also DuckDuckGo founder Gabriel Weinberg expressed the sentiment that the index should be separate from the search engine many years ago:<p>> Our approach was to treat the “copy the Internet” part as the commodity. You could get it from multiple places. When I started, Google, Yahoo, Yandex and Microsoft were all building indexes. We focused on doing things the other guys couldn’t do. [2]<p>From what I remember reading once DuckDuckGo doesn't use Common Crawl though.<p>[1] <a href="https://commoncrawl.org/" rel="nofollow">https://commoncrawl.org/</a><p>[2] <a href="https://www.japantimes.co.jp/news/2013/07/28/business/duckduckgo-chief-spills-on-search-engine-wars/" rel="nofollow">https://www.japantimes.co.jp/news/2013/07/28/business/duckdu...</a>
There are two entities trying to pull this off:<p>Common Crawl (non-profit): Stores regular, broad, monthly crawls as WARC files. Provides a separate index that can be used to look data up (no a fulltext index though). Used mostly in academia.<p>Mixnode (for-profit): Regularly crawls the web and lets users write SQL queries against the data. Not sure who the primary users are since it's in private beta.<p>There are some search engine APIs, but I don't think the conflict of interest would allow for cost-effective large-scale access and pricing...
Could the Internet Archive, specifically <a href="https://web.archive.org/" rel="nofollow">https://web.archive.org/</a> be the basis of an Open Web Index as proposed by the author?<p>I'm sure there are tons of obstacles to that path, but it also would be far ahead of any new initiative in at least two ways: it already has a huge index and ingestion pipeline, and it is a trusted organization.
It seems like the idea is recommending the Open Web Index (has its own website).<p>I like a modified version of this. I think that it should be a p2p technology and not try to create one meta-index but rather be many domain-specific ones, with one or more tools or DBs to select which indices to search given a query/context.<p>Are there any decentralized alternatives to Google out there already?<p>I think that also this overlaps with the idea of moving from a server-centric internet to a content-centric internet.
While I like the idea, I fear the potential for abuse, conflict and community splits. It will need some sort of moderation, at least to prevent:<p>1. spam<p>2. child pornography<p>3. content against the laws<p>The only thing that is easy to define as policy is #2. No one likes child porn. But even then, there are grey areas with differing legal status - lolicon on the anime side and "barely legal" on the realistic side, plus CGI.<p>Spam - for me I'd flag all commercial advertisings as spam, others would heasitate to block Viagra spammers.<p>Then the final category: illegal content. The US doesn't like nipples. Germany has no problem with nipples. Swastikas and other NS insignia? Other way around. Some post-Soviet states have banned Hammer and Sickle or the Red Star. Some countries have extremely strict libel laws, others have non-existing libel laws. In some countries (hello Germany) even <i>linking</i> to illegal content can get you thrown into jail, in others not.<p>And finally: who should pay for operational costs of such an index? Wikipedia only works out because the contributors worldwide donate <i>enormous</i> amounts of time to it, and Wikipedia has only a fraction of the amount of content that Youtube and Twitter create, and Facebook is orders of magnitude bigger.
There are lots of niche directories out there - if you consider Reddit wikis, "awesome" lists and so on.<p>A few of us out there are also working on small directories:<p>* <a href="https://href.cool" rel="nofollow">https://href.cool</a> (mine)<p>* <a href="https://indieseek.xyz" rel="nofollow">https://indieseek.xyz</a><p>* <a href="https://iwebthings.com" rel="nofollow">https://iwebthings.com</a><p>The thought is that you can actually navigate a small directory - they don't need to be five levels deep - and a network of these would rival a huge directory, avoid centralization, editor wars, single point of failure.
The web needs to be forked into two distinct standards: One for dynamic content, and one for documents. The first would use basically everything in the HTML5/CSS/JS toolbox, and the second would be more akin to AMP, but for all docs.<p>The benefits of this would be a standard for Wysiwyg editors (goodbye million rich text editor projects, Markdown and even Microsoft Word), and more semantic markup for both search engines and accessibility.<p>Right now it takes millions of man hours to create a performant browser, which is limiting those engines to only the largest organizations. Even Microsoft gave up making their own. And even with all that effort, I still can't create a clean HTML document with an interface as rich as MS Word, or even add bold or color formatting to a Twitter post, or update a Wikipedia page without knowing wiki markup.<p>We need to pull the dynamic, JS powered side of the web out from the core, limit CSS to non-dynamic properties, and standardize on an efficient in-document binary storage akin to MIME email attachments so HTML docs can be self-contained like a Word or PDF doc.<p>This document-centric web could be marked off within a standard web page, so you could combine it in regular interfaces for things like social network posts. Or it could be self standing, allowing relatively large sites to be created with indexes, footnotes, etc., but served from a basic static server.<p>This isn't a technical challenge, it's an organizational one. I've thought for years that Mozilla should be doing this, instead of messing with IoT and phones, etc. It's such an obvious problem that needs addressing, and would have a huge payback in terms of advancing the web as we know it.
I have been wanting this for years...<p>If you look at the original Yahoo Page when Yahoo first started out it attempted to solve this problem.<p>I believe this index could be regionally or language based...<p>In the United States one could use<p>Dewey Decimal<p><a href="https://en.wikipedia.org/wiki/Dewey_Decimal_Classification" rel="nofollow">https://en.wikipedia.org/wiki/Dewey_Decimal_Classification</a><p>Library of Congress<p><a href="https://en.wikipedia.org/wiki/Library_of_Congress_Classification" rel="nofollow">https://en.wikipedia.org/wiki/Library_of_Congress_Classifica...</a>
I've always thought it would make more sense if each web server could be responsible for indexing the material that it serves (and offer notifications of updates), so instead of having to crawl everything yourself, you could just request the index from each domain, and then merge them.
The PDF is a little short on details. It sounds like webamsters would all have to cooperate with allowing crawls from an "OWI" bot.<p>One of the challenges of creating a "web index" is first creating indexes of each website. "Crawling" to discover every page of a website, as well as all links to external sites, is labour-intensive and relatively inefficient. Part of that is because there is no 100% reliable way to know, before we begin accessing a website, each and every URL for each and every page of the site. There are inconsistent efforts such "site index" pages or the "sitemap" protocol (introduced by Google), but we cannot rely on all websites to create a comprehensive list of pages and to share it.<p>However, I believe there is a way to generate such a list from something that almost all websites do create: logs.<p>When Google crawls a website, it is often or maybe even always the case that the site generates logs of every HTTP request that googlebot makes.<p>If a website were to share publicly, in some standardised format, the portion of their log where googlebot has most recently crawled the site, we might see a URL for each and every page of the site that Google has requested.<p>Automating this procedure of sharing listings of those googlebot HTTP requests, the public could generate a "site index" directly from the source, via the information on googlebot requests in the logs.<p>Allowing crawls from a "new" bot would not be necessary.<p>Webmasters know what URLs they offer to Google. Google knows as well. The public, however, does not.<p>It is a public web. Absent mistakes by webmasters, any pages that Google is allowed to crawl are intended to be public.<p>Why should the public not have access to a list of all the pages of websites that Google crawls?<p>I don't know, but there must be reasons I have failed to consider.<p>What are the reasons the public <i>not know</i> what pages are publicly available via the web, except as made visible (or invisible) through a middleman like Google?<p>There are none.<p>Being able to see logs of all the googlebot requests would be one way to see what Google has in their index without actually accessing Google.
How far is this from the (now defunct) DMOZ?<p>Publicly maintained directory that I believe was at least theoretically independent of the larger web companies. It certainly had it share of drama, but was a decent human vetted index of what was out there....
As a user, if some other search engine can serve results that are better than Google, I'd be happy to use it. I've tried duckduckgo, the results are disappointing and often mis-intepreted what I intended to search. So I kept coming back to Google.<p>Will Google be willing to open its indexes? Probably not at their best interest, because it will help its competitors?
I assume that in a world of competitive index users, there is no one size fits all. Presumably application design (and feature) choices will heavily influence how the index should work.<p>For simple "I know TF-IDF, let's build a toy search engine" it will suffice, but apart from that?
With the advance of fiber networks I think each browser/device will have their own web index. One problem is web sites that can only handle 1-3 simultaneous users. It will be an eternal hug of death from all the crawling.
The index itself is already separate in a sense that nobody is being stopped to do the indexing task themselves.<p>Google is a private for-profit company so we cannot realistically expect them to provide something for free to the public without generating profits in return.<p>The web index is not a locked up proprietary resource by anyone, so people can do the indexing themselves but the real question is how do you fund a service that will keep increasing its workload exponentially and indefinitely? What institution will have the required resources to bare such costs?
whole document here from arxiv.org page :<p><a href="https://arxiv.org/pdf/1903.03846" rel="nofollow">https://arxiv.org/pdf/1903.03846</a> [PDF]
I didn't see mention of who would pay for this infrastructure. Is it considered a gov't funded or volunteer / donation thing?<p>There doesn't seem to be a mention of how to alleviate a tragedy of the commons problem (unless I missed it). If common crawl is doing a fine job, who funds them?
Abstract:<p><i>A proposal for building an index of the Web that separates the infrastructure part of the search engine - the index - from the services part that will form the basis for myriad search engines and other services utilizing Web data on top of a public infrastructure open to everyone.</i>
I asked a Google Engineer in a Google Interview (at the end of it, when you get the chance to ask them questions) - if Google would ever make it's infrastructure available to the public so they could leverage it in whatever way they wanted.<p>He had no idea what I was talking about.
Somebody could try to build own crawler and feed them with 260MM domain names dataset from <a href="https://domains-index.com" rel="nofollow">https://domains-index.com</a>
Ultimately, Wikipedia provides an effective keyword lookup that maps to curated links.<p>Regardless, the notion of a general web index is well nigh moot at this point due to its not having been built into the system from the get-go. Any such attempt at this point will be, by definition, ad hoc and built by some group of individuals, with the vastness of the content, the cost of the project and the intrinsic conflicts that will no doubt arise making independence from finance and legal issues non-trivial, to say the least.<p>Really, Wikipedia is the most sensible foundation I can imagine, given that Google has become a self-serving for-profit corporate advertising machine.
I have argued that one regulatory outcome over Google could be the open release of their index - and even their database of "if you searched for X and clicked the top link then came back five seconds later we can infer the top link is not good for X"<p>And yes I know that's pretty much all of Google. It's just that it's hard to get away from the idea that an index of web pages is anything other than the property of the people who created each web page and the links on it.<p>And it's not such a big leap to argue that data that is generated by my behaviour is actually my data. (if is likely to be personally identifying data - or perhaps a different term like personally deanonimisable)<p>I do agree with the general direction of GDPR - but I honestly think the digital trail we leave is a different class of problem that needs different classes of legal concepts to work with.<p>I think digital data is a form of intellectual property that I create just by moving in the digital realm.<p>And if you have to pay me to use my data to sell me ads, you will likely stop.
Ironically it is EU regulations that make this idea totally impossible. One does not simply index documents, at least not for Europeans. You have to expurgate your index for the "right to be forgotten" people. You have to remove all the Nazi stuff because of Germans. This idea by a German is not possible because Europe.
This simply isn't needed, and if it is it can be done by a charity or any group of people, not something that should be built into the infrastructure of the web itself.<p>You have to remember that while the little that the web provides is also its strongest attraction; it allows the web to be accessed and modified by anyone, they're on little bit of the web can be very different from someone elses.<p>So by adding on a way that the web must be indexed is kind of like moving closer to communism than liberalism. I guess if we start dictating to google where to get their data then we've moved to the full blown hammer and sickle stage :).