For an information based on standards --- HTML as a document markup language, HTTP as a transport layer, TLS/SSL for security, TCP/IP as an underlying networking protocol, among others --- one that is conspciuously missing is an <i>indexing standard</i>.<p>That is, <i>even if a site wanted to</i>, there's no way for it to declare "I have content related to X". Even better would be if these indices could then be distributed in a cache-and-forward model similar to how DNS (another distributed discovery index) works. There was some exceedingly rudimentary attempt at this through elements such as keyword meta tags, but even at best these referenced a vanishingly small fraction of the actual content of a site or article. Sitemaps also address a component of the problem, but again, only in part.<p>Some might see a few immediate issues. One is that not all site are sufficiently dynamic to know what content they actually contain. To an extent this might be addressable through extension to the webserver protocol such that a server would be aware, or <i>become</i> aware, of what content it contained.<p>Another is that a site might in some instances be inclined to misrepresent <i>what</i> it contained. This may be hard for some to believe, but I'm given to understand it occasionally does occur. To help guard against this, there might be <i>vetted</i> indices, in which one or more third parties <i>vouch</i> for the validity of an index. These reputation-sources could of course themselves be assessed for accuracy.<p>But <i>if sites were responsible for reporting on what content they actually contained, and could be constrained to doing so accurately</i>, a huge part of the overhead in creating independent search engine, and breaking the seach-engine monopoly, would be eliminated.<p>One might imagine why certain existing gatekeepers over Web standards might oppose such an initiative.<p>There would still remain <i>other</i> problems to solve within search space. It's possible to divide General Web Search into a set of specific problems:<p>- Site crawling: this includes determining search targets, any exclusions from such lists, and performing the actual crawling. Self-indexing addresses part of this problem.<p>- Indexing: Mapping of actual contents to keyword and query terms which might address that content.<p>- Ranking: Assigning a preference / deprecation to specific sites. This is essentially a trust / reputation assessment, with a canonicity / authenticity assessment (e.g., where did a specific item or document first appear).<p>- SEO: This is the Red Queen's Race issue in addressing insincere / malicous actors. Strong and durable penalties for abuse, and long-term reputational accrual, should be useful here.<p>- Query interpretation: There's a considerable art to figuring out what a question actually means. In some cases queries should be taken strictly verbatim. Quite often, however, interpretation is necessary. How those alternatives are posed might vary, with an option not often employed presently being to suggest a range of potential interpretations or related queries which might produce better results for specific query scenarios.<p>- Presentation: This is generation of the serch engine result page itself, incorporating several of the other considerations listed, but also addressing usability, accessibility, clarity, and other concerns.<p>- Revalidation: As the editors of the Hitchiker's Guide observed, the Universe is not static, and circumstances change. Revalidating, revisiting, and revising results and reputational assessments is necessary.<p>- Monetisation/Funding: I'm partial to a public goods model, or perhaps a farebox role via ISPs, pro-rated to general income/wealth within a region. Advertising, as a famous Stanford research paper prophetically observed, forces disallignment with searchers' interests and objectives.