Long Read: Lessons from Building Semantic Search for GitHub and Why I Failed

146 点作者 zxt_tzx2 个月前

12 条评论

It's somewhat ironic that the author advocates for keeping it simple and using pgvector but then buries a ton of complexity with an API server, auth server, Cloudflare workers, and durable objects. Especially given> Supabase easily the most expensive part of my stack (at $200/month, if we ran in it XL, i.e. the lowest tier with 4-core CPU)That could get you a pretty decent VPS and allow you to coassemble everything with less complexity. This is exemplified in some of the gotchas, like> Cloudflare Workers demand an entirely different pattern, even compared to other serverless runtimes like LambdaIf I'm hacking something together, learning an entirely different pattern for some third-party service is the last thing I want to do.All that being said though, maybe all it would've done is prolong the inevitable death due to the product gap the author concludes with.

评论 #43305346 未加载

评论 #43309253 未加载

zxt_tzx2 个月前

Author here. Over the last few months, I have built and launched a free semantic search tool for GitHub called SemHub (<a href="https://semhub.dev/" rel="nofollow">https://semhub.dev/</a>). In this blog post, I share what I’ve learned and why I’ve failed, so that other builders can learn from my experience. This blog post runs long and I have sign-posted each section. I have marked the sections that I consider the particularly insightful with an asterisk (*).I have also summarized my key lessons here:1. Default to pgvector, avoid premature optimization.2. You probably can get away with shorter embeddings if you’re using Matryoshka embedding models.3. Filtering with vector search may be harder than you expect.4. If you love full stack TypeScript and use AWS, you’ll love SST. One day, I wish I can recommend Cloudflare in equally strong terms too.5. Building is only half the battle. You have to solve a big enough problem and meet your users where they’re at.

评论 #43304654 未加载

评论 #43301057 未加载

评论 #43302382 未加载

评论 #43300952 未加载

评论 #43304724 未加载

评论 #43300948 未加载

评论 #43301629 未加载

评论 #43301682 未加载

whakim2 个月前

I was the first employee at a company which uses RAG (Halcyon), and I’ve been working through issues with various vector store providers for almost two years now. We’ve gone from tens of thousands to billions of embeddings in that timeframe - so I feel qualified to at least offer my opinion on the problem.I agree that starting with pgvector is wise. It’s the thing you already have (postgres), and it works pretty well out of the box. But there are definitely gotchas that don’t usually get mentioned. Although the pgvector filtering story is better than it was a year ago, high-cardinality filters still feel like a bit of an afterthought (low-cardinality filters can be solved with partial indices even at scale). You should also be aware that the workload for ANN is pretty different from normal web-app stuff, so you probably want your embeddings in a separate, differently-optimized database. And if you do lots of updates or deletes, you’ll need to make sure autovacuum is properly tuned or else index performance will suffer. Finally, building HNSW indices in Postgres is still extremely slow (even with parallel index builds), so it is difficult to experiment with index hyperparameters at scale.Dedicated vector stores often solve some of these problems but create others. Index builds are often much faster, and you’re working at a higher level (for better or worse) so there’s less time spent on tuning indices or database configurations. But (as mentioned in other comments) keeping your data in sync is a huge issue. Even if updates and deletes aren’t a big part of your workload, figuring out what metadata to index alongside your vectors can be challenging. Adding new pieces of metadata may involve rebuilding the entire index, so you need a robust way to move terabytes of data reasonably quickly. The other challenge I’ve found is that filtering is often the “special sauce” that vector store providers bring to the table, so it’s pretty difficult to reason about the performance and recall of various types of filters.

评论 #43309074 未加载

评论 #43307758 未加载

评论 #43306805 未加载

johnfn2 个月前

That was a great write up.If you don't mind me giving you some unsolicited product feedback: I think SemHub didn't do well because it's unclear what problem it's actually solving. Who actually wants your product? What's the use case? I use GitHub issues all the time, and I can't think of a reason I'd want semhub. If I need to find a particular issue on, say, TypeScript, I'll just google "github typescript issue [description]" and pull up the correct thing 9 times out of 10. And that's already a pretty rare percentage of the time I spend on GitHub.

评论 #43305552 未加载

评论 #43309116 未加载

nchmy2 个月前

This seems pretty similar to something that the ManticoreSearch team released a year ago<a href="https://manticoresearch.com/blog/manticoresearch-github-issue-search-demo/" rel="nofollow">https://manticoresearch.com/blog/manticoresearch-github-issu...</a>You can index any GH repo and then search it with vector, keyword, hybrid and more. There's faceting and anything else you could ever want. And it is astoundingly fast - even vector search.Here's the direct link to the demo <a href="https://github.manticoresearch.com/" rel="nofollow">https://github.manticoresearch.com/</a>

评论 #43305597 未加载

VirgilShelton2 个月前

Hey Warren great job on the site, but what you'll need to do is SEO. You're a great writer so all you need to add to your writing skills is SEO. I did a basic SEO audit of semhub.dev and you have no SEO. While this is niche you'll need to add a blog to your website and use basic SEO keyword research to find what your target audience is searching for instead of just blogging to blog. Start reading <a href="https://backlinko.com/seo-basics-for-beginners" rel="nofollow">https://backlinko.com/seo-basics-for-beginners</a> and you'll be well on your way. It should take about a year for you to get some good traction. Don't rush, just keep learning more and more everyday and you'll get there in a few years with organic SEO alone. The comments here alone are proof that you have a viable MVP.GL!

serjester2 个月前

Great write up, especially agree on pgvector with small (ideally fine tuned) embeddings. There’s so much complexity that comes with keeping your vector db in sync with you main db (especially once you start filtering with metadata). 90% of gen AI apps don’t need it.

评论 #43305619 未加载

scottyeager2 个月前

> * No way to search across multiple repos within GitHub. > * No way to easily see open and closed issues in the same view.I don't quite understand, because searching issues across all of Github and also within orgs is already supported. Those searches show both open and closed issues by default.For searches on a single repo, just removing the "state" filter entirely from the query also shows open and closed issues.I do think that semantic search on issues is a cool idea and the semantic/fuzzy aspect is probably the biggest motivator for the project. It just felt funny to see stuff that Github can actually already do listed at the top of motivating issues.

评论 #43312351 未加载

brian-armstrong2 个月前

Am I misunderstanding what is meant by semantic code search? I thought the idea was that you run something like a parser on the repo to extract function/class/variable names and then allow searching on a more rich set of data, rather than tokenizing it like English.I know github kind of added this but their version falls apart still even in common languages like C++. It's not unusual for it to just completely miss cross references, even in smaller repos. A proper compiler's eye view of symbolic data would be super useful, and Github's halfway attempt can be frustratingly daft about it.

评论 #43305624 未加载

franky472 个月前

I started a quick weekend project to do just that today: index my OSS project's [1] issues & discussions, so I can RAG-ask it to find references when I feel like I'm repeating myself (in "see issue/PR/discussion #123", finding the 123 is the hardest part).This article might be super helpful, thanks! I don't intend to make a product out of it though, so I can cut a lot of corners, like using a PAT for auth and running everything locally.[1] <a href="https://github.com/47ng/nuqs">https://github.com/47ng/nuqs</a>

评论 #43305666 未加载

nosefrog2 个月前

> When using Cloudflare Workers as an API server, I have experienced requests that would “fail silently” and leave a “hanging connection”, with no error thrown, no log emitted, and a frontend that is just loading. Honestly, no idea what’s up with this.Yikes, these sorts of errors are so hard to debug. Especially if you don't have a real server to log into to get pcaps.

评论 #43303226 未加载

gregorvand2 个月前

Hi Warren, great article. Would love to connect on what we're doing (also in Singapore). Please drop me a message gregor@vand.hk