TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Subreddit Finder - Trained on 4M Reddit Posts from 4K Subreddits

152 pointsby Arimbrabout 5 years ago

25 comments

H8crilAabout 5 years ago
&quot;What is the penalty for living&quot; -&gt; <a href="http:&#x2F;&#x2F;reddit.com&#x2F;r&#x2F;Poland" rel="nofollow">http:&#x2F;&#x2F;reddit.com&#x2F;r&#x2F;Poland</a>, 28%<p>&quot;When should I kill my chicken&quot; -&gt; <a href="http:&#x2F;&#x2F;reddit.com&#x2F;r&#x2F;csgo" rel="nofollow">http:&#x2F;&#x2F;reddit.com&#x2F;r&#x2F;csgo</a>, 19%<p>&quot;Am I conscious&quot; -&gt; <a href="http:&#x2F;&#x2F;reddit.com&#x2F;r&#x2F;INTP" rel="nofollow">http:&#x2F;&#x2F;reddit.com&#x2F;r&#x2F;INTP</a>, 25%<p>&quot;How to not think&quot; -&gt; <a href="http:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;howtonotgiveafuck&#x2F;" rel="nofollow">http:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;howtonotgiveafuck&#x2F;</a>, 49%<p>&quot;Is the government evil&quot; -&gt; <a href="http:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;ENLIGHTENEDCENTRISM&#x2F;" rel="nofollow">http:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;ENLIGHTENEDCENTRISM&#x2F;</a>, 19%<p>&quot;Is the government good&quot; -&gt; <a href="http:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;CoronavirusUK" rel="nofollow">http:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;CoronavirusUK</a>, 10%<p>&quot;Is the government useful&quot; -&gt; <a href="http:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;iran" rel="nofollow">http:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;iran</a>, 31%
评论 #22830106 未加载
localcrisisabout 5 years ago
I tried &quot;find hot local singles in your area&quot; and the top result was &#x2F;r&#x2F;vinyls<p>actually very impressed
评论 #22825190 未加载
jrumbutabout 5 years ago
A very cool demo and I congratulate the author, but I am always a little sad for more data science type demos that try to answer the question (that is proving toxic) &quot;given what I know about you, how can I find a community of people just like you?&quot;<p>I would love to see a subreddit finder that answers questions like &quot;what community would complement your interests?&quot; or &quot;what community needs to hear what you have to say?&quot; or &quot;what community would be made better by your presence?&quot;. Similarity is at best a proxy for it.<p>Those are harder but, I think, more useful.
评论 #22833498 未加载
评论 #22828474 未加载
评论 #22829791 未加载
Nextgridabout 5 years ago
I tried it with &quot;best time tracking app for iOS?&quot; and &quot;I&#x27;m looking for a time tracking app. Any recommendations?&quot;<p>I expected the iPhone or iOS subreddit to be suggested, but it suggested GearVR | 13.0%, ringdoorbell | 9.0%, canadacordcutters | 5.0%, TTVreborn | 5.0%, AusSkincare | 4.0%, sideloaded | 4.0%, FlutterDev | 2.0%, shopify | 2.0%, weightwatchers | 2.0%, crossfit | 2.0%.<p>Congrats on the attempt but it does still need some work.
评论 #22825008 未加载
评论 #22825601 未加载
gnicholasabout 5 years ago
The intercom chat widget makes the tab title switch back and forth between &quot;Subreddit Finder&quot; and &quot;Valohai says&quot;. There does not appear to be a way to dismiss the chat widget, so it just keeps flipping back and forth, which is visually annoying.<p>I keep many tabs open, but I am going to close this one immediately because I don&#x27;t want to have something flashing at me out of the corner of my eye all day.
Der_Einzigeabout 5 years ago
One place to improve this would be to use a better set of word-embeddings. FastText is, well, fast, but it&#x27;s no longer close to SOTA.<p>You&#x27;re most likely using simple average pooling, which is why many users are getting results that don&#x27;t look right to them. Try a chunking approach, where you get a vector for each chunk of the document and horizontally concatenate those together (if your vectors are 50d, and do 5 chunks per doc, than you get a 250d fixed vector for each document regardless of length). This partially solves the issue of highly diluted vectors which is responsible for the poor results that some users are reporting. You can also do &quot;attentive pooling&quot; where you pool the way a transformer head would pool - though that&#x27;s an O(N^2) operation so YMMV<p>If you have the GPU compute, try something like BERT, or GPT-2 which is fine-tuned on all of reddit. Better yet, try vertically concatenating all of the word-embeddings models you can together (just stack the embeddings from each model) if you have the compute<p>To respond to your comment (since HN isn&#x27;t letting me post cus I&#x27;m &#x27;posting too fast&#x27;)<p>You can use cheaper and more effective approaches for getting the subword functionality you want.<p>Look up &quot;Byte Pair Embeddings&quot;. That will also handle the OOV problem but for far less CPU&#x2F;RAM overhead. BERT also does this for you with its unique form of tokenization.<p>A home CPU can fine-tune FastText in a day on 4 million documents if you&#x27;re able to walk away from your computer for awhile. Shouldn&#x27;t cost you anything except electricity. If you set the number of epochs higher, you&#x27;ll get better performance but correspondingly longer times to train.<p>For BERT&#x2F;GPT-2, you&#x27;ll maybe want to fine-tune a small version of the model (say, the 117m parameter version of GPT-2) and then vertically concatenate that with the regular un-fine-tuned GPT-2 model. That should be very fast and hopefully not expensive (and also possible on your home GPU)
评论 #22824849 未加载
BatFastardabout 5 years ago
Would be nice to have the subreddits be links. So I could just click it to open a new tab of that subreddit.
评论 #22824006 未加载
weaponizedwordsabout 5 years ago
Tried it with Hearthstone related content. Title: turn 2 lethal Content: I managed to cheat out 4 prophet valens on turn 2 followed up by mind blast.<p>Results: shadowverse, elderscrollslegends, teamfighttactics, teemotalk, fioramains, ekkomains, ezrealmains, bobstavern, kaisamains, xcom2<p>Should include: hearthstone It did pick up BobsTavern which is something. I thought you would want some feedback.
评论 #22824378 未加载
exegeteabout 5 years ago
Cool. Last year I created something like this as a Chrome extension so that you could type in your post and it would show up on reddit where to post. You could then just select it by clicking a link. Project is here <a href="https:&#x2F;&#x2F;github.com&#x2F;wesbarnett&#x2F;insight" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;wesbarnett&#x2F;insight</a>
评论 #22826360 未加载
jnwatsonabout 5 years ago
Searching for &quot;marijuana&quot; should point to trees (internal Reddit joke) and not marijuana primarily.
评论 #22824672 未加载
applecrazyabout 5 years ago
This is pretty good. Typed in &quot;$spy 1000&quot; and it said r&#x2F;wallstreetbets (100%). Accurate.
ramraj07about 5 years ago
Tried stocks, stock options, investing - all kept giving Robinhoodpennystocks as the top option. Not sure if the model is not fully trained?<p>What are some examples where the model does recommend meaningful things?
评论 #22824216 未加载
duxupabout 5 years ago
Suggested subreddits for this post:<p>lostredditors 45%<p>Well yes that is likely, but maybe not a good suggestion as that is a place where folks point out people who posted the wrong thing in the wrong sub or conversation ;)
评论 #22823698 未加载
评论 #22823964 未加载
SeekingMeaningabout 5 years ago
“I got straight A’s this semester!!!”<p>aggies 19.0%
bluetwoabout 5 years ago
Clicking a sub-reddit name should open the sub-reddit in a new window&#x2F;tab.
SkyPuncherabout 5 years ago
This is awesome!<p>I often find that I when I&#x27;m buying something new, I want to find subs related to that product category.<p>While this doesn&#x27;t find me direct results, it should me communities that I should focus my research on.
s_devabout 5 years ago
Tried to find &#x2F;r&#x2F;DevelEire using the search terms &quot;Irish Software Developers&quot;<p>No luck but Google will bring it up as a first result if the query is &quot;Irish Software Developers Reddit&quot;.
评论 #22825157 未加载
评论 #22825115 未加载
jokoonabout 5 years ago
I wish reddit would allow me to download all my comments.<p>Apparently it&#x27;s not possible since they&#x27;re all archived, because reddit constantly regenerate its webpages.
评论 #22826986 未加载
_v7guabout 5 years ago
Is r&#x2F;thedonald not included in the database? I&#x27;m trying the usual suspect titles, but get nothing.
评论 #22833417 未加载
minimaxirabout 5 years ago
You should probably add how exactly you retrieved the 4 million Reddit posts.
评论 #22824008 未加载
评论 #22823953 未加载
dominotwabout 5 years ago
hey i tried<p>title: My siberian cat Message: My floof<p>I was hoping to find r&#x2F;SiberianCats where i usually post but it wasn&#x27;t in the list.<p>I googled &quot;siberian cat subreddit&quot; and r&#x2F;SiberianCats was the first link.
评论 #22825228 未加载
评论 #22825229 未加载
pmoriartyabout 5 years ago
Or you could just ask here:<p><a href="https:&#x2F;&#x2F;old.reddit.com&#x2F;r&#x2F;findareddit&#x2F;" rel="nofollow">https:&#x2F;&#x2F;old.reddit.com&#x2F;r&#x2F;findareddit&#x2F;</a>
评论 #22833446 未加载
dehrmannabout 5 years ago
Anyone remember &#x2F;r&#x2F;reddit.com? That was around the time reddit looked like Hacker News and people were embarrassed to admit they use it.
评论 #22824213 未加载
评论 #22825530 未加载
评论 #22824065 未加载
评论 #22826396 未加载
bobberkarlabout 5 years ago
I just ran some pretty dumb queries. I can <i>assure</i> you something is missing.
评论 #22825249 未加载
dzongaabout 5 years ago
good attempt. but needs a lot of work.
评论 #22825242 未加载