Hi HN,<p>I spent a few weeks looking at the top HN posts of all time. This included exploration, clustering, creating visualizations, and zooming in on what (to me personally) seems like some of the best discussions on here.<p>Three things in this post:<p>1- The interesting groups of HN posts<p>2- The interactive visualizations that you can explore in your browser<p>3- The data from this exploration -- this includes CSV of the titles as well as the text embeddings of 3,000 Ask HN articles.<p>Blog post about this whole process here: [1]<p>============<p>1- The interesting groups of HN posts<p>From the exploration, Ask HN proved the most interesting. These are the top four groups of topics I found insightful. Each group contains about 400 posts.<p>- Life experiences and advice threads [2]<p>- Technical and personal development [3]<p>- Software career insights, advice, and discussions [4]<p>- General content recommendations (blogs/podcasts) [5]<p>============<p>2- The interactive visualizations that you can explore in your browser<p>- Top 10,000 Hacker News articles of all time [6]<p>- Top 3,000 posts in Ask HN [7]<p>============<p>3- The data from this exploration<p>CSV file of top 3K Ask HN posts: [8]<p>The sentence embeddings of the titles of those posts: [9]<p>This is a colab notebook containing the code examples (including loading these two data files): [10]<p>============<p>If you've ever wanted to get into language models, this is a good place to start. Happy to answer any questions
The conflict of interest here concerns me. I don't object to content marketing, but I'd rather a) you were clear from the start that you work for this company and are promoting its product, and b) that this "revolves around [...] using Cohere’s Embed endpoint", so that people can judge how much they want to "get into language models" with pay-per-character pricing, as opposed to something more open.
There's too much fixation with "top" in our industry. Top voted tends to mostly be a function of early posting. Later posts don't get votes because they simply were not seen. There seems to be a misreading on a mass scale of what "top" really indicates though; people think it means "quality" when it does not. Study after study, website after website, policy after policy, our online world is built on this fundamental misunderstanding of what is really going on. How do you avoid piling on to this misunderstanding?
Interesting, but it doesn't seem like the dimensionality reduction produces a good separation of topics. The UMAP projection looks pretty dense. Did you consider pruning or using something other than embeddings?
Nice idea and analysis! I reproduced it as well with <a href="https://graphext.com" rel="nofollow">https://graphext.com</a> and got similar clusters <a href="https://drive.google.com/file/d/1-kXsKezu2_S07rQn-0bjbHuUXHEZUqg4/view?usp=sharing" rel="nofollow">https://drive.google.com/file/d/1-kXsKezu2_S07rQn-0bjbHuUXHE...</a>
I don't know how HN score metrics work but after some short review of the datafile [1] I've noticed a lot of the posts has the form of a simple questions and as such seems to be naturally biased when comes to user engagement. Have you considered to add additional metrics to remove that bias and re-analyze?<p>[1] <a href="https://storage.googleapis.com/cohere-assets/blog/text-clustering/data/askhn3k_df.csv" rel="nofollow">https://storage.googleapis.com/cohere-assets/blog/text-clust...</a>
as people upvote other things than your list of relevant links it becomes difficult to find the relevant links. although I guess people can find it by your name.<p>on edit: so it seems some are upvoting the links to keep them on the top in opposition to those upvoting discussion points.