Direct URL to project details: <a href="https://devpost.com/software/tagger-news" rel="nofollow">https://devpost.com/software/tagger-news</a><p>A few comments:<p>1. To other commenters, as with the HN Vue demo a week ago (<a href="https://news.ycombinator.com/item?id=14284877" rel="nofollow">https://news.ycombinator.com/item?id=14284877</a>), the project is a technical proof-of-concept; the aesthetics aren't the primary focus.<p>2. The Algolia API is better for scraping because it allows for bulk requests, unlike the official API (my old 2014 script still works I think: <a href="https://github.com/minimaxir/get-all-hacker-news-submissions-comments" rel="nofollow">https://github.com/minimaxir/get-all-hacker-news-submissions...</a>)<p>3) How much time did it take to manually label the training/test set before training the RF classifier? Even with topic modeling for <i>extrapolating</i> tags, accurate labeling for 20,000 submissions is a task.
I think the biggest value proposition of this is the ability to do sub-reddit like filtering on specific tags. As Hacker News grows I think dealing with the number of new submissions would become a bottleneck. During high-traffic times, new submissions sometimes drop of the first 'new' page in around 10 minutes. Of course there is more traffic during these times to upvote good content, but I'm not sure that is better than letting a smaller number of people have a longer period of time to filter a smaller collection of content.
Strength of Hacker News is the network effect of a diverse, intelligent crowd. It would be hard to replace. A supplemental site tagging it and aiding search has value. Biggest problem I have searching HN, though, is Google mixing up stories and comments. The fix might be as simple as two domains that contain stories and comments cloned from HN, one domain for each, followed by Google Searches within those domains. Not sure if Google would automatically crawl it, though.
Congratulations to the team, although it seems the algorithm isn't very accurate, since this article[0] from 1997 was tagged with "Blockchain".<p>[0]: <a href="https://www.gnu.org/philosophy/right-to-read.html?source=techstories.org" rel="nofollow">https://www.gnu.org/philosophy/right-to-read.html?source=tec...</a>
I did similar project few months ago, it does automatic tagging + summarization of HN largely using scipy and numpy, you can see it in action: <a href="http://hntop.org" rel="nofollow">http://hntop.org</a>
here github link <a href="https://github.com/bexp/textai" rel="nofollow">https://github.com/bexp/textai</a>
Not to hijack, but this is similar to a small ML project a friend and I built. It takes news headlines from a bunch of sources and classifies them by common topic. We took a lot longer than a day to build it, though. ;)<p>It refreshes with new stories every few hours. You can check it out here: <a href="http://headlinr.herokuapp.com/" rel="nofollow">http://headlinr.herokuapp.com/</a><p>EDIT: click on the bubbles to see individual headlines. Also, here's the GitHub page: <a href="https://github.com/dgarrick/headliner" rel="nofollow">https://github.com/dgarrick/headliner</a>
I think this is great, especially being able to click a tag and see a top 30 list of that tag.<p>Obvious suggestions that would make it usable as a primary HN interface:<p>• Login and voting (not sure the HN API supports this though)<p>• Tag suggestions to feed into the model
Here is my take on it few years ago. I tried to make it more like magazine, and get article text and photo, and there is a section with only articles that reached top position.
<a href="http://www.hnzine.com" rel="nofollow">http://www.hnzine.com</a>
Possibly off-topic, but: this was done at the "Disrupt NYC" hackathon, and somehow "add tags to HN" feels like about the least disruptive thing ever.
Wow I was thinking to make something similar: an experimental HN fork where submissions are tagged (collaboratively) <i>but</i> without titles, as these are rarely useful to predict the content of an article. And of course there is also the convenience of categorization.
This article on book publishing: <a href="https://news.ycombinator.com/item?id=14334845" rel="nofollow">https://news.ycombinator.com/item?id=14334845</a> is tagged with "blockchain" only. Any idea why?
Tried <a href="http://www.taggernews.com/tags/aws/" rel="nofollow">http://www.taggernews.com/tags/aws/</a> and didn't find any results. Is it because it is listing only few tags for now?
This is what @twitter needs to do to make discovering worthwhile information better.<p>Hashtag spam makes #hashtags mostly useless as a method of discovery.