TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: My related-posts finder script (with LLM and GPT4 enhancement)

69 pointsby tomhazledineover 1 year ago
I&#x27;ve open-sourced the script I use to find related blog posts (and to describe <i>why</i> they&#x27;re similar).<p>Works on any set of markdown articles, so should fit into any SSG workflow.<p>Uses embeddings to calculate the similarities, and GPT4 to add descriptive text.

6 comments

iDonover 1 year ago
Extending that idea to the web, or at least to the blogosphere and information &#x2F; knowledge web-sites, seems useful. I wonder if there is a web service which has calculated vector embeddings for some of the web, and supports vector search, e.g. given a URL, find URLs with similar embeddings. Inverting that, web-sites could annotate their web pages with embeddings via json-ld; which search engines could utilise. Both these ideas might be impractical, e.g. the cost of http GET of the vector might be similar to the cost of calculating the embedding; and the embedding would be only comparable with embeddings from the same model (which would be recorded in the json-ld) so it would age quickly. It would also be subject to SEO gaming, like meta tags.<p>A quick search didn&#x27;t find either of these; the closest was this paper which used json-ld to record a vector reduced to 2 dimensions using tSNE : <a href="https:&#x2F;&#x2F;hajirajabeen.github.io&#x2F;publications&#x2F;Metadata_for_Emebdding.pdf" rel="nofollow noreferrer">https:&#x2F;&#x2F;hajirajabeen.github.io&#x2F;publications&#x2F;Metadata_for_Eme...</a> Metadata standards for the FAIR sharing of vector embeddings in Biomedicine S¸ enay Kafkas et al.
评论 #38567448 未加载
评论 #38577217 未加载
评论 #38568193 未加载
TOMDMover 1 year ago
Really neat, thank you for posting!<p>How did you end up handling the case where posts are longer than the context window for the embedding API?
评论 #38565405 未加载
评论 #38567480 未加载
评论 #38564671 未加载
Hittonover 1 year ago
Cool, but the ultimate result - the related articles in that post - is not great. Suggested articles have barely anything to do with the topic. I think there should be minimal threshold (probably found empirically) for similarity and if no article reaches it, just don&#x27;t show any.
评论 #38567751 未加载
pabeover 1 year ago
Thanks for contributing your work as OSS and writing a comprehensive blog post about it :)<p>I like the simplicity of your approach without vector DB etc. In case you want to add one, Typesense seems to be a good OSS fit.
评论 #38567408 未加载
cjover 1 year ago
FYI, from OpenAI<p>&gt; Most models that support the legacy Completions endpoint will be shut off on January 4th, 2024.
评论 #38567766 未加载
gwernover 1 year ago
I do something similar with the OA API embeddings on my static website ( <a href="https:&#x2F;&#x2F;www.gwern.net" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.gwern.net</a> ; crummy code at <a href="https:&#x2F;&#x2F;github.com&#x2F;gwern&#x2F;gwern.net&#x2F;">https:&#x2F;&#x2F;github.com&#x2F;gwern&#x2F;gwern.net&#x2F;</a> ) for the &#x27;similar&#x27; feature: call OA API with embedding, nearest-neighbor via cosine, list of links for suggested further reading.<p>Because it&#x27;s a static site, managing the similar links poses the difficulties OP mentions: where do you store &amp; update it? In the raw original Markdown? We solve it by transclusion: the list of &#x27;similar&#x27; links is stored in a separate HTML snippet, which is just transcluded into the web page on demand. The snippets can be arbitrarily updated without affecting the Markdown essay source. We do this for other things too, it&#x27;s a handy design pattern for static sites, to make things more compositional (allowing one HTML snippet to be reused in arbitrarily many places or allowing &#x27;extremely large&#x27; pages) at the cost of some client-side work doing the transclusion.<p>I refine it in a couple ways: I don&#x27;t need to call GPT-4 for summarization because the links all have abstracts&#x2F;excerpts; I usually write abstracts for my own essays&#x2F;posts (which everyone should do, and if the summaries are good enough to embed, why not just use them yourself for your posts? would also help your cache &amp; cost issues, and be more useful than the &#x27;explanation&#x27;). Then I also throw in the table of contents (which is implicitly an abstract), available metadata like tags &amp; authors, and I further throw into the embeddings a list of the parsed links as well as <i>reverse citations&#x2F;backlinks</i>. My assumption is that these improve the embedding by explicitly listing the URLs&#x2F;titles of references, and what other pages find a given thing worth linking.<p>Parsing the links means I can improve the list of suggestions by deleting anything already linked in the article. OP has so few posts this may not be a problem for him, if you are heavily hyperlinking and also have good embeddings (like I do), this will happen a lot, and it is annoying to a reader to be suggested links he has already seen and either looked at or ignored. This also means that it&#x27;s easy to provide a curated &#x27;see also&#x27; list: simply dump the similar list at the beginning, and keep the ones you like. They will be filtered out of the generated list automatically, so you can present known-good ones upfront and then the similars provide a regularly updated list of more. (Which helps handle the tension he notes between making a static list up front while new links regularly enter the system.)<p>One neat thing you can do with a list of hits, that I haven&#x27;t seen anyone else do, is <i>sort them by distance</i>. The default presentation everyone does is to simply present them in order of distance to the target. This is sorta sensible because you at least see the &#x27;closest&#x27; first, but the more links you have, the smaller the difference is, and the more that sorting looks completely arbitrary. What you can do instead is sort them by their distance <i>to each other</i>: if you do that, even in a simple greedy way, you get what is a list which automatically clusters by the internal topics. (Imagine there are two &#x27;clusters&#x27; of topics equidistant to the current article; the default distance sort would give you something random-looking like A&#x2F;B&#x2F;B&#x2F;A&#x2F;B&#x2F;A&#x2F;A&#x2F;A&#x2F;B&#x2F;B&#x2F;A, which is painful to read, but if you sort by distance to each other to minimize the total distance, you&#x27;d get something more like B&#x2F;B&#x2F;B&#x2F;B&#x2F;B&#x2F;B&#x2F;A&#x2F;A&#x2F;A&#x2F;A&#x2F;A&#x2F;A.) I call this &#x27;sort by magic&#x27; or &#x27;sort by semantic similarity&#x27;: <a href="https:&#x2F;&#x2F;gwern.net&#x2F;design#future-tag-features" rel="nofollow noreferrer">https:&#x2F;&#x2F;gwern.net&#x2F;design#future-tag-features</a> You can then, of course, take a cluster and have GPT-4 write a label describing it.<p>Additional notes: I would not present &#x27;Similarity score: 79% match&#x27; because I assume this is just the cosine distance, which is equal for both suggestions (and therefore not helpful) and also is completely embedding dependent and basically arbitrary. (A good heuristic is: would it mean anything to the reader if the number were smaller, larger, or has one less digit? A &#x27;similarity score&#x27; of 89%, or 7.9, or 70%, would all mean the same thing to the reader - nothing.)<p>&gt; Complex or not, calculating cosine similarity is a lot less work than creating a fully-fledged search algorithm, and the results will be of similar quality. In fact, I&#x27;d be willing to bet that the embedding-based search would win a head-to-head comparison most of the time.<p>You are probably wrong. The full search algorithm, using exact word count indexes of everything, is highly competitive with embedding search. If you are interested, the baseline you&#x27;re looking for in research papers on retrieval is &#x27;BM25&#x27;.<p>&gt; For each post, the script then finds the top two most-similar posts based on the cosine similarity of the embedding vectors.<p>Why only top two? It&#x27;s at the bottom of the page, you&#x27;re hardly hurting for space.
评论 #38571416 未加载