Algorithmic tagging of Hacker News or any other site

105 点作者 doppenhe大约 11 年前

22 条评论

Goosey大约 11 年前

Looking at the hn demo, I'm impressed. There are definitely relevant tags being generated. Unfortunately there also some noisy tags which clutter the results. Taking one example, the post "DevOps? Join us in the fight against the Big Telcos" given the tags "phone tools sendhub we're news experience customers comfortable", I would say that "we're" is unarguably noise. Another example, "Questions for Donald Knuth" with tags "computer programming don i've knuth taocp algorithms i'm" I would call out "i've" and "i'm".There are other words in both examples that I personally would not use as tags, but I can't really say they would be universally not-useful. I think a vast improvement could be made just by having a dictionary blacklist filled with things like these - from this tiny sampling contractions seem to be a big loser.

评论 #7778497 未加载

vhf大约 11 年前

Very interesting.I have been doing some research towards automatic tagging lately, and I found several Python project coming close to this goal : <a href="https://pypi.python.org/pypi/topia.termextract/" rel="nofollow">https://pypi.python.org/pypi/topia.termextract/</a> , <a href="https://github.com/aneesha/RAKE" rel="nofollow">https://github.com/aneesha/RAKE</a> , <a href="https://github.com/ednapiranha/auto-tagify" rel="nofollow">https://github.com/ednapiranha/auto-tagify</a>but none of them is satisfying, whereas Algorithmic Tagging of HN looks pretty good.I have been trying to implement a similar feature for <a href="http://reSRC.io" rel="nofollow">http://reSRC.io</a>, to automagically tag articles for easy retrieval through the tag search engine.

评论 #7779363 未加载

评论 #7782161 未加载

sytelus大约 11 年前

Well, it's not that easy. The algorithms are very primitive and too full of noise to be useful.For example, try this on restaurant reviews like <a href="http://www.yelp.com/biz/el-gaucho-seattle" rel="nofollow">http://www.yelp.com/biz/el-gaucho-seattle</a>. I get these tags:steak reviews seattle food service gaucho restaurant reviewNot useful, right?The current state of the art would use much more sophisticated NLP for generating POS tags and use sentiment analysis. For example, check out MSR Splat at <a href="http://research.microsoft.com/en-us/projects/msrsplat/default.aspx" rel="nofollow">http://research.microsoft.com/en-us/projects/msrsplat/defaul...</a>.

Theodores大约 11 年前

This does well on the 'T Shirt test' on some sites, e.g. <a href="http://www.riverisland.com/men/t-shirts--vests" rel="nofollow">http://www.riverisland.com/men/t-shirts--vests</a>This could be really useful in ecommerce for creating search keywords for category pages. The noise in the results matters not, so long as it gets 'T-Shirt' and someone searches for 'T-shirt' then all is well and good.Are you looking to plug what you have into something such as the Magento e-commerce platform? The right clients could pay proper money for this functionality. It is something I would quite like to speak to you about.

评论 #7778762 未加载

EGreg大约 11 年前

LDA is very impressive. But it might be better to have an iterative algorithm that forms a linear-algebraic basis from several tags (and let people add more tags as vectors into the mix) and then every time people upvote something, you update their interests (points in the linear algebraic space) and then every time an article gets upvoted you update ITS tags ...after a while the system converges to a very useful structure and new members can see correctly tagged articles and the system learns their interests by itselfdo you know anything like this already existing?

评论 #7780068 未加载

dlsym大约 11 年前

This has real poetic potential:<pre><code> "Erlang and code style" process erlang undefined file write data true code</code></pre>

zokier大约 11 年前

After watching "Enough Machine Learning to Make Hacker News Readable Again"[1] I thought of recommendation engine/machine learning based linkshare/discussion system (eg HN/reddit style). Your frontpage would be continuously formed by your up/down-votes. I'm not sure if the same could be applied to comment threads too, essentially creating automatic moderation. Algorithmic tagging would certainly be useful for that kind of site.[1] <a href="https://news.ycombinator.com/item?id=7712297" rel="nofollow">https://news.ycombinator.com/item?id=7712297</a>

NKCSS大约 11 年前

Not too impressed to be honest; singular/plural forms are not treated equal; not familiar with LDA, but I've written and LSA implementation in the past, and it did a lot better than what is shown here.

评论 #7782868 未加载

NicoJuicy大约 11 年前

Lol, this seriously took me by suprise. I'm currently developing a HackerNews with tags (you can self host it). I quickly generated this Google Form, if you are interested for being a beta user in the nearby future<a href="https://docs.google.com/forms/d/1UeSD11hrjwhsVbbPiv63VZBrEczzG5Tr4lwkuKAzY8A/viewform?usp=send_form" rel="nofollow">https://docs.google.com/forms/d/1UeSD11hrjwhsVbbPiv63VZBrEcz...</a>PS. Screenshot included + it's already in alpha in a company with 100 users.

评论 #7778864 未加载

snippyhollow大约 11 年前

I did that in 2012 for a pet project with a friend <a href="https://github.com/SnippyHolloW/HN_stats" rel="nofollow">https://github.com/SnippyHolloW/HN_stats</a>Here is the trained topic model (Nov. 30, 2012) with only 40 topics (for file-size mainly) <a href="https://dl.dropboxusercontent.com/u/14035465/hn40_lemmatized.ldamodel" rel="nofollow">https://dl.dropboxusercontent.com/u/14035465/hn40_lemmatized...</a>You can load it with Python:<pre><code> from gensim.models import ldamodel lda = ldamodel.LdaModel.load("hn40_lemmatized.ldamodel") lda.alpha = [lda.alpha for _ in range(40)] # because there was a change since 2012 lda.show_topics() </code></pre> Now if you can figure out what is this file: <a href="https://dl.dropboxusercontent.com/u/14035465/pg40.params" rel="nofollow">https://dl.dropboxusercontent.com/u/14035465/pg40.params</a> I'll pay you a beer next time you're in Paris or I'm in the Valley. ;-)

评论 #7778508 未加载

andrew_gardener大约 11 年前

After receiving 42 comments, I've ran their tagging algorithm on this page and got:tags tagging hours link doppenhe reply ago ldalooks pretty promising!

gibrown大约 11 年前

LDA/Topic Modeling is interesting stuff. I always feel like the way this data gets surfaced as "tags" is very ineffective. Any non-tech person would look at this and generally be confused. So this item is triggering my rants against tagging: - Tagging is like trying to predict the future. What word will help some future person to get to this content? - Tagging often tries to fill the hole left by bad search - There is no evaluation method to measure how good a set of tags are - Tags make very bad UI clutter.Some of these points are related to encouraging users to tag content, but auto-tagging also seems problematic.To me something more along the lines of entity extraction is more useful because it is a well defined problem, and can be used to improve a lot of other applications.

评论 #7779426 未加载

评论 #7782177 未加载

评论 #7782880 未加载

NicoJuicy大约 11 年前

I like this project (i am creating something like this, so i'm pretty serious).But doesn't the auto-tagging feature make to much noise for a business use-case? For example, it tags a article of Amazon and includes Google in the tags. White-listing words wouldn't fix this (Google is a whitelisted word if Amazon is).I don't know about LDA though. Perhaps a proper tag administration would fix this, but then you'd have to remove tags on the go.

评论 #7778943 未加载

platypii大约 11 年前

Direct link to HN with tags: <a href="http://algorithmia.com/demo/hn" rel="nofollow">http://algorithmia.com/demo/hn</a>

评论 #7779196 未加载

nopal大约 11 年前

Has anyone seen Open Calais [1]? It does tagging and categorization. It's been around for years and seems pretty powerful. It's a bit lower-level than Algorithmia (not href aware), but it seems more powerful, and a system like Algorithmia could be built on it.[1] <a href="http://www.opencalais.com/about" rel="nofollow">http://www.opencalais.com/about</a>

doczoidberg大约 11 年前

With german sites it does not work so well. There is no blacklist for to generic terms for other languages than english?

shawabawa3大约 11 年前

Doesn't seem to handle pdfs properly. For the mtgox link it comes up with> stream rotate type/page font structparents endobj obj endstream

评论 #7778542 未加载

pjbrunet大约 11 年前

I signed up. Not sure if i would use it, but the Algorithmia concept is pretty interesting.

draz大约 11 年前

@doppenhe - any hunch as to how well it would work on transcripts?

评论 #7779421 未加载

vincentbarr大约 11 年前

error: failed to find worker for algorithm

评论 #7780698 未加载

hnriot大约 11 年前

Maybe also take a look at AlchemyAPI

justplay大约 11 年前

looks cool.

22 条评论

Goosey大约 11 年前

评论 #7778497 未加载

vhf大约 11 年前

评论 #7779363 未加载

评论 #7782161 未加载

sytelus大约 11 年前

Theodores大约 11 年前

评论 #7778762 未加载

EGreg大约 11 年前

评论 #7780068 未加载

dlsym大约 11 年前

This has real poetic potential:<pre><code> "Erlang and code style" process erlang undefined file write data true code</code></pre>

zokier大约 11 年前

NKCSS大约 11 年前

评论 #7782868 未加载

NicoJuicy大约 11 年前

评论 #7778864 未加载

snippyhollow大约 11 年前

评论 #7778508 未加载

andrew_gardener大约 11 年前

After receiving 42 comments, I've ran their tagging algorithm on this page and got:tags tagging hours link doppenhe reply ago ldalooks pretty promising!

gibrown大约 11 年前

评论 #7779426 未加载

评论 #7782177 未加载

评论 #7782880 未加载

NicoJuicy大约 11 年前

评论 #7778943 未加载

platypii大约 11 年前

Direct link to HN with tags: <a href="http://algorithmia.com/demo/hn" rel="nofollow">http://algorithmia.com/demo/hn</a>

评论 #7779196 未加载

nopal大约 11 年前

doczoidberg大约 11 年前

With german sites it does not work so well. There is no blacklist for to generic terms for other languages than english?

shawabawa3大约 11 年前

Doesn't seem to handle pdfs properly. For the mtgox link it comes up with> stream rotate type/page font structparents endobj obj endstream

评论 #7778542 未加载

pjbrunet大约 11 年前

I signed up. Not sure if i would use it, but the Algorithmia concept is pretty interesting.

draz大约 11 年前

@doppenhe - any hunch as to how well it would work on transcripts?

评论 #7779421 未加载

vincentbarr大约 11 年前

error: failed to find worker for algorithm

评论 #7780698 未加载

hnriot大约 11 年前

Maybe also take a look at AlchemyAPI

justplay大约 11 年前

looks cool.