TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Out-of-the-box text classification models

20 pointsby brianjkim21about 1 year ago

6 comments

ramses0about 1 year ago
You&#x27;re doing yourself a disservice by making your &quot;Class&#x2F;Subclass&#x2F;Sub-Subclass&quot; outcome being a fading disappearing text gif.<p>It takes ~30 seconds for your animation to loop, of which ~8 seconds of it lets you see the outcome.<p>Once you&#x27;ve started to see&#x2F;parse the outcome, then your (my) instinct is to go back to the initial (small, tough to read) &quot;Support Ticket&quot; text and see if it matches, then I&#x27;m back in the &quot;Spend 22 more seconds to get 8 seconds of validation&quot; loop. Times 4. Because you have 4 of these examples listed.<p>Give up and use tables: <a href="https:&#x2F;&#x2F;twitter.com&#x2F;tectonic&#x2F;status&#x2F;552241947604054016" rel="nofollow">https:&#x2F;&#x2F;twitter.com&#x2F;tectonic&#x2F;status&#x2F;552241947604054016</a>
评论 #39750380 未加载
jimmySixDOFabout 1 year ago
I like this approach it treats the problem like a reranker it would be interesting to byo classifiers you could do the embedding to match the hierarchy approach which is a good idea it seems. Also the code for Raptor came out recently and they are showing gains through a tree&#x2F;hierarchy too.<p><a href="https:&#x2F;&#x2F;arxiv.org&#x2F;html&#x2F;2401.18059v1" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;html&#x2F;2401.18059v1</a>
评论 #39750619 未加载
aleksiy123about 1 year ago
I feel like it would be better to display the speed in ms instead of seconds.<p>30-50ms seems easier to read than 0.03s - 0.05s. especially since you say milliseconds in one of the taglines.<p>Also curious how no rate limiting works. Feels like a promise waiting to be broken.<p>Otherwise, really like that you have a nice hobby plan. Will need to find a place to try it out.<p>Custom topics is the killer feature though I feel.
throwaway81523about 1 year ago
Any info about how it works? I&#x27;m more interested in the the technology than in using the product. Is there a query limit for the free API though? I do have a couple of ideas.
评论 #39750579 未加载
Terrettaabout 1 year ago
TL;DR: API needs user-defined taxonomies and better data confidentiality. Intent classification is a hit.<p>---<p>This is fantastic, but as you acknowledge with the &#x27;reach out to us&#x27; on your launch page, people are going to need custom topic taxonomies. We use several custom ones, maintained as YAML that non-technical users can edit.<p>I&#x27;m guessing from having been looking for a project like yours <i>for a decade now</i>, that it&#x27;s that custom taxonomy problem that means most OOTB don&#x27;t work for people, so they make their own which they don&#x27;t open source because they ended up ... tailoring ... a topic text classifier for themselves.<p>The only thing I&#x27;ve found close to this &quot;OOTB&quot; is:<p><a href="https:&#x2F;&#x2F;cloud.google.com&#x2F;natural-language&#x2F;docs&#x2F;classifying-text" rel="nofollow">https:&#x2F;&#x2F;cloud.google.com&#x2F;natural-language&#x2F;docs&#x2F;classifying-t...</a><p><a href="https:&#x2F;&#x2F;cloud.google.com&#x2F;natural-language&#x2F;docs&#x2F;categories#categories_version_2" rel="nofollow">https:&#x2F;&#x2F;cloud.google.com&#x2F;natural-language&#x2F;docs&#x2F;categories#ca...</a><p>And, to be frank, I can&#x27;t see why I&#x27;d send my confidential information to you when I can send it to Google. (Ahem!)<p>But the problem with theirs and yours is the OOTB categories are for a global topic set, something like Yahoo directory, rather than for a given discipline. And what&#x27;s generally needed is a set of disciplines, or several topic trees. (Think Amazon.com instead of Yahoo.)<p>I&#x27;ve found the general lists, like LCM[^1] (what you really want is LCSH[^2] subject headings, not LCM), too broad for my business or personal content, while something like ACM[^3] is more what&#x27;s needed for, say, computing related content.<p>For a firmwide knowledge base at a {field}-tech firm, you have a mix of the firm&#x27;s focus field, and computing, and a broad scope fallback like you&#x27;re starting with. Even libraries have their own topic hierarchy! [^4]. Plenty fields have controlled vocabularies[^6], and if you can&#x27;t find one for a field, you can usually generate one by finding someone who is already classifying that field, and looking at their TOC. All of which is to say, to be generally useful, you have to let people BYOT (bring your own topics) for this.<p>For instance, we built our topic list based on combining a reference taxonomy for our field, a reference taxonomy for computing, a reference taxonomy for business books, and the Google NLP tool mentioned above.<p>There are occasional tools that try to match arbitrary documents to arbitrary hierarchies such as clerk [^5] but they are challenging for various reasons.<p>You have a note to contact you for different topics, but raising this here since so far (6 hours) you had no feedback, and I&#x27;m a big fan of what you&#x27;re doing and the niche is underserved.<p>A couple other thoughts:<p>Aside from topics taxonomy or hierarchy, we&#x27;ve recently found that something like properties based classification proves needed when we&#x27;re 10K+ to 100K+ short and long form content documents in the knowledge base. For instance, <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Colon_classification" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Colon_classification</a>, that adds &quot;facets&quot; like time dimension. This is incredibly helpful for relevance while still being able to drill in and just browse a topics&#x2F;branch&#x2F;leaf.<p>I really like your &quot;intent&quot; classification, far more interesting than sentiment, since it could help separate blog posts from new articles, self-guided tutorials from reviews, and so on: Problem Solving, News, Informational, maybe?. Sifting these to focus a robust KB can be tremendously valuable.<p>Your privacy policy is by-and-large useless, since the information being classified is unlikely personal (PII) class, and more likely confidential or non-public (NPI) class.<p>You are, effectively, saying &quot;let us have a copy of all info you&#x27;re classifying&quot;, yet nowhere on your main site nor docs site do you explain how you actively prevent yourselves from seeing an API user&#x27;s information.<p>Ideally your &quot;architecture&quot; would explain how you built it to be able to do the work <i>without</i> you being able to see the content, not just a &quot;pinky swear we won&#x27;t look&quot; sort of promise. Many businesses have their own confidentiality and privacy policies. Those require looping in subprocessors, which is you, and right now you can&#x27;t be used.<p>Your API is on the right track with the types of classification already, especially its intent classification—it&#x27;s a feature that users will find useful. But don&#x27;t overlook the customization of topic classifications; it&#x27;s what many are seeking and not finding elsewhere. The real concern, though, is confidentiality. You are asking firms to trust you with their content. Without that trust, even the best features can&#x27;t win firms over because they don&#x27;t have a choice but to control their data. Make sure firms understand exactly how you&#x27;re keeping their data secure. Get this right, and you&#x27;ll have a product that stands out for all the right reasons.<p>---<p>[^1]: <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Library_of_Congress_Classification" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Library_of_Congress_Classifica...</a><p>[^2]: <a href="https:&#x2F;&#x2F;id.loc.gov&#x2F;authorities&#x2F;subjects.html" rel="nofollow">https:&#x2F;&#x2F;id.loc.gov&#x2F;authorities&#x2F;subjects.html</a><p>[^3]: <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;ACM_Computing_Classification_System" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;ACM_Computing_Classification_S...</a><p>[^4]: <a href="https:&#x2F;&#x2F;www.ala.org&#x2F;tools&#x2F;topics&#x2F;atoz" rel="nofollow">https:&#x2F;&#x2F;www.ala.org&#x2F;tools&#x2F;topics&#x2F;atoz</a><p>[^5]: <a href="https:&#x2F;&#x2F;github.com&#x2F;blankenshipz&#x2F;clerk&#x2F;tree&#x2F;main">https:&#x2F;&#x2F;github.com&#x2F;blankenshipz&#x2F;clerk&#x2F;tree&#x2F;main</a><p>[^6]: <a href="https:&#x2F;&#x2F;pitt.libguides.com&#x2F;metadatadiscovery&#x2F;controlledvocabularies" rel="nofollow">https:&#x2F;&#x2F;pitt.libguides.com&#x2F;metadatadiscovery&#x2F;controlledvocab...</a>
评论 #39747011 未加载
yantramsabout 1 year ago
Congrats on the launch. This is something I&#x27;d spent some time on few years ago. I hacked together something similar for my usecase by reverse engineering. No ML model though - Using Nearest neighbours and Tversky similarity measures in Julia with the same taxonomy that you are using.<p>Tested with one of the comments from this thread.<p><pre><code> requests.post( &quot;https:&#x2F;&#x2F;x2vud9xfq0.execute-api.ap-south-1.amazonaws.com&#x2F;api&#x2F;text&#x2F;classify&quot;, json={ &quot;text&quot;: &quot;&quot;&quot; And, to be frank, I can&#x27;t see why I&#x27;d send my confidential information to you when I can send it to Google. (Ahem!) But the problem with theirs and yours is the OOTB categories are for a global topic set, something like Yahoo directory, rather than for a given discipline. And what&#x27;s generally needed is a set of disciplines, or several topic trees. (Think Amazon.com instead of Yahoo.) I&#x27;ve found the general lists, like LCM[^1] (what you really want is LCSH[^2] subject headings, not LCM), too broad for my business or personal content, while something like ACM[^3] is more what&#x27;s needed for, say, computing related content. For a firmwide knowledge base at a {field}-tech firm, you have a mix of the firm&#x27;s focus field, and computing, and a broad scope fallback like you&#x27;re starting with. Even libraries have their own topic hierarchy! [^4]. Plenty fields have controlled vocabularies[^6], and if you can&#x27;t find one for a field, you can usually generate one by finding someone who is already classifying that field, and looking at their TOC. All of which is to say, to be generally useful, you have to let people BYOT (bring your own topics) for this. For instance, we built our topic list based on combining a reference taxonomy for our field, a reference taxonomy for computing, a reference taxonomy for business books, and the Google NLP tool mentioned above. There are occasional tools that try to match arbitrary documents to arbitrary hierarchies such as clerk [^5] but they are challenging for various reasons. You have a note to contact you for different topics, but raising this here since so far (6 hours) you had no feedback, and I&#x27;m a big fan of what you&#x27;re doing and the niche is underserved. A couple other thoughts: &quot;&quot;&quot;, &#x27;key&#x27;: &#x27;HACKERNEWS&#x27; } ).json() { &#x27;genres&#x27;: {&#x27;Technology&#x27;: 24, &#x27;Finance&#x27;: 16, &#x27;Education&#x27;: 11}, &#x27;tags&#x27;: {&#x27;&#x2F;Business &amp; Industrial&#x2F;Small Business&#x2F;MLM &amp; Business Opportunities&#x27;: 5.094265117745211, &#x27;&#x2F;Internet &amp; Telecom&#x2F;Web Services&#x27;: 5.51434499612552, &#x27;&#x2F;Finance&#x2F;Investing&#x27;: 5.72584536853734, &#x27;&#x2F;Business &amp; Industrial&#x2F;Business Operations&#x27;: 5.888633926463297, &#x27;&#x2F;Jobs &amp; Education&#x2F;Education&#x2F;Standardized &amp; Admissions Tests&#x27;: 6.0132143106028435, &#x27;&#x2F;Business &amp; Industrial&#x2F;Business Services&#x27;: 6.100261915913882, &#x27;&#x2F;Jobs &amp; Education&#x2F;Jobs&#x27;: 6.126547614437338, &#x27;&#x2F;Science&#x2F;Earth Sciences&#x2F;Atmospheric Science&#x27;: 6.1553064528175545, &#x27;&#x2F;Finance&#x27;: 6.249046550441405, &#x27;&#x2F;Business &amp; Industrial&#x27;: 6.333431648078183}, &#x27;id&#x27;: &#x27;65f891a111ec14ddd4b56bda&#x27; } </code></pre> Your result<p><pre><code> { &quot;result&quot;: [ [ &quot;&#x2F;Arts &amp; Entertainment&#x2F;Books &amp; Literature&#x2F;Reference&quot;, 0.138976 ], [ &quot;&#x2F;Jobs &amp; Education&#x2F;Job Listings&quot;, 0.138976 ], [ &quot;&#x2F;Computers &amp; Technology&#x2F;Networking&#x2F;Distributed &amp; Cloud Computing&quot;, 0.069488 ], [ &quot;&#x2F;Jobs &amp; Education&#x2F;Online Learning&quot;, 0.069488 ], [ &quot;&#x2F;Arts &amp; Entertainment&#x2F;Music &amp; Audio&#x2F;Music Reference&quot;, 0.046325 ] ] }</code></pre>