TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: A naive classifier to figure out if a sentence contains dirty words

10 pointsby gauravmabout 10 years ago

5 comments

kristiandupontabout 10 years ago
A friend of mine told me about how they once had to dig through their code to figure out why their site was classified as adult by some filter. After days of searching, they found this comment at the bottom of a javascript file:<p>&#x2F;&#x2F; Slut.<p>Which is Danish for &quot;the end&quot;.
评论 #9402561 未加载
spoilerabout 10 years ago
I don&#x27;t find this very useful. It&#x27;s <i>too</i> naïve for a real-world usecase.<p>I didn&#x27;t look at the implementation, but the &quot;classy party&quot; looks like it simply matches for a sequence of &#x27;a&#x27;, &#x27;s&#x27;, and &#x27;s&#x27; bytes in a string.<p>It would be better it it tokenized the sentence using punctuation and white-space as terminators. So, it would detect `big-ass sandwich` and `smart-ass person` but not `classy party` or `bass instrument`.<p>Furthermore, it would be cool if you created a configuration format for this kind of thing, so one could do something like this (excuse the config format, I realise it&#x27;s probably shit and problematic):<p><pre><code> [smart][big][fat]ass !sex[ual]+education </code></pre> which would detect all of the following: smartass, bigass, fatass, <i>and</i> ass itself. The second rule would <i>not</i> filter `sex(?:ual)` token followed by an `education` token. You get the idea<p>These are just some heat-of-the-moment ideas, because I think this is exciting and could be useful. :-)
评论 #9404052 未加载
nopuremoreabout 10 years ago
With the little effort of google translate your dirty words to Spanish (copy paste all words), you obtain a filter for Spanish, add synonyms for stronger filtering.<p>Perhaps gay is not a dirty word? (is included in your dirty words, but gay people should think otherwise.
评论 #9402847 未加载
radio4fanabout 10 years ago
I inherited a (dreadful) application which had a hilariously lame &#x27;rude words filter&#x27;. It checked for words on a banned list.<p>The full list is here: <a href="http:&#x2F;&#x2F;pastebin.com&#x2F;raw.php?i=1Pv4v8j7" rel="nofollow">http:&#x2F;&#x2F;pastebin.com&#x2F;raw.php?i=1Pv4v8j7</a><p>It contains such gems as &quot;cockburger&quot;, &quot;penispuffer&quot;, and -- the piece de resistance -- &quot;twatwaffleunclefucker&quot;.
natchabout 10 years ago
What is the use case for such a classifier?
评论 #9404073 未加载