科技回声

HN is a great source of news. Not every subject is of equal interest to everybody though. Ideally I would like to have a personalized feed of submissions that may interest me, with a short summary of the linked article.I now use the RSS feed of new submissions to automatically filter posts on keywords in the title (e.g. "space" but not "SpaceX"). This has the added benefit that niche articles with 0 upvotes will still get my attention. An AI should be able to fairly easily categorize each article so you could filter on category/label instead of literal keywords. This could also address a request often seen on HN "is there an HN focussed on subject xyz", because this way focus can be achieved by means of a filter with HN as the single source. A generated TL;DR would be a bonus.Knowing the amount of smart and AI focussed people over here, probably someone has already thought about or even implemented this. So what do you think? Ideally this would be a publicly available feature instead of everyone analyzing and filtering articles themselves.Edit: changed "posts" to submissions to make clear it is about submitted articles, not comments

Hacker News with Tags:- <a href="https://histre.com/hn" rel="nofollow">https://histre.com/hn</a> (discussion: <a href="https://hw.leftium.com/#/item/35904988" rel="nofollow">https://hw.leftium.com/#/item/35904988</a>)I use Kagi summarizer to give me the TL;DR for articles. Kagi provides two levels of summary:- <a href="https://blog.kagi.com/universal-summarizer" rel="nofollow">https://blog.kagi.com/universal-summarizer</a>

My YOShInOn RSS reader works pretty well on HN comments. It ingests about 110 feeds including <a href="https://hnrss.org/bestcomments" rel="nofollow">https://hnrss.org/bestcomments</a>, early I had tried using <a href="https://hnrss.org/newcomments" rel="nofollow">https://hnrss.org/newcomments</a> but the volume was overwhelming when compared to the set of feeds I had at the time.I treat recommendation as a classification problem, I run documents through a model from SBERT and then do clustering, classification and such with tools from scikit-learn. The system currently trains on my last 120 days worth of judgements and takes about 3 minutes to train, evaluate and calibrate a model.k-means clustering works great for lumping articles into big categories, for instance sports articles wind up together, articles about computer programming, others about the Ukraine war, etc. These categories aren't labeled but the system works by clustering the data and showing me the highest scoring articles. I like the results a lot.99% of the posts that I make to HN were selected by the system and selected by me twice.You can ask ChatGPT to do a topic classification; if you are lucky and suggestible you'll probably be impressed with the results initially, but when the honeymoon is over you will see it won't be as accurate as you like. It's also slow and expensive.I've thought about developing a topic classifier using the same methods I use for recommendation, the main challenge here is getting a training set. My take is that it takes 2000-8000 labeled examples to make a good classifier for one category so if you wanted to support 20 categories you will need 40,000-160,000 labeled documents. Labeling 1000 documents a day takes about as much time and energy as a serious videogame habit, I have at times labeled 4000 images a day but I found it has effects on my visual system including hallucinations. (e.g. go label photos of people and then ride the bus and you'll find yourself classifying people automatically as to whether they have "short hair" or "medium hair" or whatever)There are some ways to cheat. <a href="https://tildes.net/" rel="nofollow">https://tildes.net/</a> has a pretty good classification system and I've been tempted to crawl the site, also some newspapers have a good classification system. (YOShInOn has avoided using these because I want it to learn to read text) My k-Means clusters correspond more or less to topics so if I did a little editing of the results that would also be a fast way to build a training set.Another question is what inputs to use: the title or more of the article? In the case of an "Ask HN" the title might be all you want. The titles are easy to pull out of the HN API but crawling the actual articles will be a lot more work and mean collecting vastly more data. There's a real limit of how well you'll do with titles because some titles are ambiguous.

Ask HN: AI labeling and TL;DR for HN posts?

2 条评论

Ask HN: AI labeling and TL;DR for HN posts?

2 条评论