Tagger News takes a subset of HN articles, analyzes using ML, and applies tags

308 pointsby var_explainedabout 8 years ago

18 comments

minimaxirabout 8 years ago

Direct URL to project details: <a href="https://devpost.com/software/tagger-news" rel="nofollow">https://devpost.com/software/tagger-news</a>A few comments:1. To other commenters, as with the HN Vue demo a week ago (<a href="https://news.ycombinator.com/item?id=14284877" rel="nofollow">https://news.ycombinator.com/item?id=14284877</a>), the project is a technical proof-of-concept; the aesthetics aren't the primary focus.2. The Algolia API is better for scraping because it allows for bulk requests, unlike the official API (my old 2014 script still works I think: <a href="https://github.com/minimaxir/get-all-hacker-news-submissions-comments" rel="nofollow">https://github.com/minimaxir/get-all-hacker-news-submissions...</a>)3) How much time did it take to manually label the training/test set before training the RF classifier? Even with topic modeling for extrapolating tags, accurate labeling for 20,000 submissions is a task.

评论 #14337516 未加载

评论 #14337937 未加载

robertelderabout 8 years ago

I think the biggest value proposition of this is the ability to do sub-reddit like filtering on specific tags. As Hacker News grows I think dealing with the number of new submissions would become a bottleneck. During high-traffic times, new submissions sometimes drop of the first 'new' page in around 10 minutes. Of course there is more traffic during these times to upvote good content, but I'm not sure that is better than letting a smaller number of people have a longer period of time to filter a smaller collection of content.

评论 #14341406 未加载

评论 #14337999 未加载

评论 #14338066 未加载

nickpsecurityabout 8 years ago

Strength of Hacker News is the network effect of a diverse, intelligent crowd. It would be hard to replace. A supplemental site tagging it and aiding search has value. Biggest problem I have searching HN, though, is Google mixing up stories and comments. The fix might be as simple as two domains that contain stories and comments cloned from HN, one domain for each, followed by Google Searches within those domains. Not sure if Google would automatically crawl it, though.

评论 #14337375 未加载

评论 #14337556 未加载

评论 #14337471 未加载

asymmetricabout 8 years ago

Congratulations to the team, although it seems the algorithm isn't very accurate, since this article[0] from 1997 was tagged with "Blockchain".[0]: <a href="https://www.gnu.org/philosophy/right-to-read.html?source=techstories.org" rel="nofollow">https://www.gnu.org/philosophy/right-to-read.html?source=tec...</a>

评论 #14337435 未加载

hntopabout 8 years ago

I did similar project few months ago, it does automatic tagging + summarization of HN largely using scipy and numpy, you can see it in action: <a href="http://hntop.org" rel="nofollow">http://hntop.org</a> here github link <a href="https://github.com/bexp/textai" rel="nofollow">https://github.com/bexp/textai</a>

salmonfamineabout 8 years ago

Not to hijack, but this is similar to a small ML project a friend and I built. It takes news headlines from a bunch of sources and classifies them by common topic. We took a lot longer than a day to build it, though. ;)It refreshes with new stories every few hours. You can check it out here: <a href="http://headlinr.herokuapp.com/" rel="nofollow">http://headlinr.herokuapp.com/</a>EDIT: click on the bubbles to see individual headlines. Also, here's the GitHub page: <a href="https://github.com/dgarrick/headliner" rel="nofollow">https://github.com/dgarrick/headliner</a>

rileymat2about 8 years ago

Personally, I find the visual weight that the tags have exceed their value to me.

评论 #14337433 未加载

评论 #14337405 未加载

评论 #14337449 未加载

评论 #14339110 未加载

评论 #14337392 未加载

shawkinawabout 8 years ago

I think this is great, especially being able to click a tag and see a top 30 list of that tag.Obvious suggestions that would make it usable as a primary HN interface:• Login and voting (not sure the HN API supports this though)• Tag suggestions to feed into the model

big_spammerabout 8 years ago

Link to try it out <a href="http://www.taggernews.com/" rel="nofollow">http://www.taggernews.com/</a>

sasoonabout 8 years ago

Here is my take on it few years ago. I tried to make it more like magazine, and get article text and photo, and there is a section with only articles that reached top position. <a href="http://www.hnzine.com" rel="nofollow">http://www.hnzine.com</a>

egypturnashabout 8 years ago

Possibly off-topic, but: this was done at the "Disrupt NYC" hackathon, and somehow "add tags to HN" feels like about the least disruptive thing ever.

peraabout 8 years ago

Wow I was thinking to make something similar: an experimental HN fork where submissions are tagged (collaboratively) but without titles, as these are rarely useful to predict the content of an article. And of course there is also the convenience of categorization.

评论 #14338019 未加载

RichardHeartabout 8 years ago

This article on book publishing: <a href="https://news.ycombinator.com/item?id=14334845" rel="nofollow">https://news.ycombinator.com/item?id=14334845</a> is tagged with "blockchain" only. Any idea why?

the_arunabout 8 years ago

Tried <a href="http://www.taggernews.com/tags/aws/" rel="nofollow">http://www.taggernews.com/tags/aws/</a> and didn't find any results. Is it because it is listing only few tags for now?

Imagenuityabout 8 years ago

This is what @twitter needs to do to make discovering worthwhile information better.Hashtag spam makes #hashtags mostly useless as a method of discovery.

faragonabout 8 years ago

Using ML? Why not using just Bayesian filters?

评论 #14338182 未加载

magicmikexxlabout 8 years ago

Can you also add a way to add TLDRs to everything, pls? :D

rocky1138about 8 years ago

Stop trying to remake Hacker News. It's pretty much perfect the way it is.

评论 #14337503 未加载

评论 #14338209 未加载

评论 #14352679 未加载

评论 #14337776 未加载