Show HN: Analyzing top HN posts with language models

117 点作者 jayalammar将近 3 年前

Hi HN,I spent a few weeks looking at the top HN posts of all time. This included exploration, clustering, creating visualizations, and zooming in on what (to me personally) seems like some of the best discussions on here.Three things in this post:1- The interesting groups of HN posts2- The interactive visualizations that you can explore in your browser3- The data from this exploration -- this includes CSV of the titles as well as the text embeddings of 3,000 Ask HN articles.Blog post about this whole process here: [1]============1- The interesting groups of HN postsFrom the exploration, Ask HN proved the most interesting. These are the top four groups of topics I found insightful. Each group contains about 400 posts.- Life experiences and advice threads [2]- Technical and personal development [3]- Software career insights, advice, and discussions [4]- General content recommendations (blogs/podcasts) [5]============2- The interactive visualizations that you can explore in your browser- Top 10,000 Hacker News articles of all time [6]- Top 3,000 posts in Ask HN [7]============3- The data from this explorationCSV file of top 3K Ask HN posts: [8]The sentence embeddings of the titles of those posts: [9]This is a colab notebook containing the code examples (including loading these two data files): [10]============If you've ever wanted to get into language models, this is a good place to start. Happy to answer any questions

11 条评论

wpietri将近 3 年前

The conflict of interest here concerns me. I don't object to content marketing, but I'd rather a) you were clear from the start that you work for this company and are promoting its product, and b) that this "revolves around [...] using Cohere’s Embed endpoint", so that people can judge how much they want to "get into language models" with pay-per-character pricing, as opposed to something more open.

评论 #31696161 未加载

评论 #31695126 未加载

评论 #31697318 未加载

jayalammar将近 3 年前

[1] <a href="https://txt.cohere.ai/combing-for-insight-in-10-000-hacker-news-posts-with-text-clustering/" rel="nofollow">https://txt.cohere.ai/combing-for-insight-in-10-000-hacker-n...</a>[2] <a href="https://assets.cohere.ai/blog/text-clustering/askhn_cluster_6a.html" rel="nofollow">https://assets.cohere.ai/blog/text-clustering/askhn_cluster_...</a>[3] <a href="https://assets.cohere.ai/blog/text-clustering/askhn_cluster_7a.html" rel="nofollow">https://assets.cohere.ai/blog/text-clustering/askhn_cluster_...</a>[4] <a href="https://assets.cohere.ai/blog/text-clustering/askhn_cluster_5a.html" rel="nofollow">https://assets.cohere.ai/blog/text-clustering/askhn_cluster_...</a>[5] <a href="https://assets.cohere.ai/blog/text-clustering/askhn_cluster_3a.html" rel="nofollow">https://assets.cohere.ai/blog/text-clustering/askhn_cluster_...</a>[6] <a href="https://assets.cohere.ai/blog/text-clustering/hn10k_clustered.html" rel="nofollow">https://assets.cohere.ai/blog/text-clustering/hn10k_clustere...</a>[7] <a href="https://assets.cohere.ai/blog/text-clustering/askhn-3k.html" rel="nofollow">https://assets.cohere.ai/blog/text-clustering/askhn-3k.html</a>[8] <a href="https://storage.googleapis.com/cohere-assets/blog/text-clustering/data/askhn3k_df.csv" rel="nofollow">https://storage.googleapis.com/cohere-assets/blog/text-clust...</a>[9] <a href="https://storage.googleapis.com/cohere-assets/blog/text-clustering/data/askhn3k_embeds.npy" rel="nofollow">https://storage.googleapis.com/cohere-assets/blog/text-clust...</a>[10] <a href="https://colab.research.google.com/github/cohere-ai/notebooks/blob/main/notebooks/Analyzing_Hacker_News_with_Six_Language_Understanding_Methods.ipynb" rel="nofollow">https://colab.research.google.com/github/cohere-ai/notebooks...</a>

评论 #31695096 未加载

natch将近 3 年前

There's too much fixation with "top" in our industry. Top voted tends to mostly be a function of early posting. Later posts don't get votes because they simply were not seen. There seems to be a misreading on a mass scale of what "top" really indicates though; people think it means "quality" when it does not. Study after study, website after website, policy after policy, our online world is built on this fundamental misunderstanding of what is really going on. How do you avoid piling on to this misunderstanding?

评论 #31698372 未加载

uniqueuid将近 3 年前

Interesting, but it doesn't seem like the dimensionality reduction produces a good separation of topics. The UMAP projection looks pretty dense. Did you consider pruning or using something other than embeddings?

评论 #31694384 未加载

victorianoi将近 3 年前

Nice idea and analysis! I reproduced it as well with <a href="https://graphext.com" rel="nofollow">https://graphext.com</a> and got similar clusters <a href="https://drive.google.com/file/d/1-kXsKezu2_S07rQn-0bjbHuUXHEZUqg4/view?usp=sharing" rel="nofollow">https://drive.google.com/file/d/1-kXsKezu2_S07rQn-0bjbHuUXHE...</a>

评论 #31697509 未加载

hoerzu将近 3 年前

I grouped posts by topic: <a href="https://n3ws.ploomber.io" rel="nofollow">https://n3ws.ploomber.io</a>Explanation: <a href="https://ploomber.io/blog/hn_classifier/" rel="nofollow">https://ploomber.io/blog/hn_classifier/</a>

评论 #31707800 未加载

toppy将近 3 年前

I don't know how HN score metrics work but after some short review of the datafile [1] I've noticed a lot of the posts has the form of a simple questions and as such seems to be naturally biased when comes to user engagement. Have you considered to add additional metrics to remove that bias and re-analyze?[1] <a href="https://storage.googleapis.com/cohere-assets/blog/text-clustering/data/askhn3k_df.csv" rel="nofollow">https://storage.googleapis.com/cohere-assets/blog/text-clust...</a>

评论 #31694393 未加载

renewiltord将近 3 年前

I love this shit, dude. Not for any great insight. I just enjoy this sort of thing. Good stuff.

bryanrasmussen将近 3 年前

as people upvote other things than your list of relevant links it becomes difficult to find the relevant links. although I guess people can find it by your name.on edit: so it seems some are upvoting the links to keep them on the top in opposition to those upvoting discussion points.

arolihas将近 3 年前

Love your blog posts and visualizations Jay, thanks for sharing!

earthboundkid将近 3 年前

The scissor statement but it’s Go vs Rust.