Looking at <a href="https://explore2.marginalia.nu/search?domain=simonwillison.net" rel="nofollow noreferrer">https://explore2.marginalia.nu/search?domain=simonwillison.n...</a> now that's an interesting service. The web has felt isolating since it became commercialized. Bloggers are living in the Google dark ages right now. Having information like this be readily accessible could help us find each other and get the band back together. The open web can be reborn.
BTW, if anyone wants to dabble in this problem space, I make among other things the entire link graph available here: <a href="https://downloads.marginalia.nu/exports/" rel="nofollow noreferrer">https://downloads.marginalia.nu/exports/</a><p>(hold my beer as I DDOS my own website by offering multi-gigabyte downloads on the HN front page ;-)
In 2012 I was trying to turn my PhD thesis into a product for a better guitar tab and song lyrics search engine. The method was precisely this: use cosine similarity on the content itself (musical instructions parsed from the tabs or the tokens of the lyrics).<p>This way I wasn't just able to get much better results searching than with PageRank but there was another benefit byproduct of this approach in that you could cluster the results and choose a distinct separate cluster for each subsequent result. With google you would not just get bad results at number 1 but results 1-20 would be near duplicates of just a few distinct efforts.<p>Unfortunately I was a terrible software engineer back then and had much to learn about making a product.
The author describes calculating cosine similarity of high dimensional vectors. If these are sparse binary vectors why not just store a list of nonzero indexes instead? That way your “similarity” is just the length of the intersection of the two sets of indexes. Maybe I’m missing something.
I love the random page: <a href="https://search.marginalia.nu/explore/random" rel="nofollow noreferrer">https://search.marginalia.nu/explore/random</a><p>This makes me feel in the old open web again.
I am surprised nobody has thought about looking into page content itself to help fight spam. If a blog has nothing except paid affiliate links (Amazon, etc.), ads, popups after page loads (news letter signups, etc) then it should probably be down ranked.<p>I have actually been developing something like that, but it does more, including down ranking certain categories of sites that contain unnecessary filler, such as some recipe sites.
It gives plausible results for websites similar to HN.<p><a href="https://explore2.marginalia.nu/search?domain=news.ycombinator.com" rel="nofollow noreferrer">https://explore2.marginalia.nu/search?domain=news.ycombinato...</a>
Aww, sadly nothing for my own websites!<p>This is such a great idea, often when I find a small blog or site I want more of it! This is the perfect tool to discover that. It’s a clear and straightforward idea in retrospect, as all really great ideas tend to be!
Concludes that a certain www.example.com is 42% similar to example.com, when they are exactly the same: one redirects to the other.<p>The only thing different is the domain names, and those character strings themselves are more than 42% similar.
I have searches the github repo for information for page ranking.<p>I am newbie in SEO. I would grately appreciate if marginalia provided clean readme about it, about their algorithm.<p>At marginalia search front page we have access to search keywords, page algorithm is important enough to be at least discussed on layman terms.<p>How to optimize page, so it could have a high ranking?<p>I undestand this could be in the code documentation, but I have not yet checked it, sorry.
Goodhart's law: "When a measure becomes a target, it ceases to be a good measure."<p>This could be helpful in the short-term, but I'm skeptical long-term as it'll become just as gamed.