TL;DR: API needs user-defined taxonomies and better data confidentiality. Intent classification is a hit.<p>---<p>This is fantastic, but as you acknowledge with the 'reach out to us' on your launch page, people are going to need custom topic taxonomies. We use several custom ones, maintained as YAML that non-technical users can edit.<p>I'm guessing from having been looking for a project like yours <i>for a decade now</i>, that it's that custom taxonomy problem that means most OOTB don't work for people, so they make their own which they don't open source because they ended up ... tailoring ... a topic text classifier for themselves.<p>The only thing I've found close to this "OOTB" is:<p><a href="https://cloud.google.com/natural-language/docs/classifying-text" rel="nofollow">https://cloud.google.com/natural-language/docs/classifying-t...</a><p><a href="https://cloud.google.com/natural-language/docs/categories#categories_version_2" rel="nofollow">https://cloud.google.com/natural-language/docs/categories#ca...</a><p>And, to be frank, I can't see why I'd send my confidential information to you when I can send it to Google. (Ahem!)<p>But the problem with theirs and yours is the OOTB categories are for a global topic set, something like Yahoo directory, rather than for a given discipline. And what's generally needed is a set of disciplines, or several topic trees. (Think Amazon.com instead of Yahoo.)<p>I've found the general lists, like LCM[^1] (what you really want is LCSH[^2] subject headings, not LCM), too broad for my business or personal content, while something like ACM[^3] is more what's needed for, say, computing related content.<p>For a firmwide knowledge base at a {field}-tech firm, you have a mix of the firm's focus field, and computing, and a broad scope fallback like you're starting with. Even libraries have their own topic hierarchy! [^4]. Plenty fields have controlled vocabularies[^6], and if you can't find one for a field, you can usually generate one by finding someone who is already classifying that field, and looking at their TOC. All of which is to say, to be generally useful, you have to let people BYOT (bring your own topics) for this.<p>For instance, we built our topic list based on combining a reference taxonomy for our field, a reference taxonomy for computing, a reference taxonomy for business books, and the Google NLP tool mentioned above.<p>There are occasional tools that try to match arbitrary documents to arbitrary hierarchies such as clerk [^5] but they are challenging for various reasons.<p>You have a note to contact you for different topics, but raising this here since so far (6 hours) you had no feedback, and I'm a big fan of what you're doing and the niche is underserved.<p>A couple other thoughts:<p>Aside from topics taxonomy or hierarchy, we've recently found that something like properties based classification proves needed when we're 10K+ to 100K+ short and long form content documents in the knowledge base. For instance, <a href="https://en.wikipedia.org/wiki/Colon_classification" rel="nofollow">https://en.wikipedia.org/wiki/Colon_classification</a>, that adds "facets" like time dimension. This is incredibly helpful for relevance while still being able to drill in and just browse a topics/branch/leaf.<p>I really like your "intent" classification, far more interesting than sentiment, since it could help separate blog posts from new articles, self-guided tutorials from reviews, and so on: Problem Solving, News, Informational, maybe?. Sifting these to focus a robust KB can be tremendously valuable.<p>Your privacy policy is by-and-large useless, since the information being classified is unlikely personal (PII) class, and more likely confidential or non-public (NPI) class.<p>You are, effectively, saying "let us have a copy of all info you're classifying", yet nowhere on your main site nor docs site do you explain how you actively prevent yourselves from seeing an API user's information.<p>Ideally your "architecture" would explain how you built it to be able to do the work <i>without</i> you being able to see the content, not just a "pinky swear we won't look" sort of promise. Many businesses have their own confidentiality and privacy policies. Those require looping in subprocessors, which is you, and right now you can't be used.<p>Your API is on the right track with the types of classification already, especially its intent classification—it's a feature that users will find useful. But don't overlook the customization of topic classifications; it's what many are seeking and not finding elsewhere. The real concern, though, is confidentiality. You are asking firms to trust you with their content. Without that trust, even the best features can't win firms over because they don't have a choice but to control their data. Make sure firms understand exactly how you're keeping their data secure. Get this right, and you'll have a product that stands out for all the right reasons.<p>---<p>[^1]: <a href="https://en.wikipedia.org/wiki/Library_of_Congress_Classification" rel="nofollow">https://en.wikipedia.org/wiki/Library_of_Congress_Classifica...</a><p>[^2]: <a href="https://id.loc.gov/authorities/subjects.html" rel="nofollow">https://id.loc.gov/authorities/subjects.html</a><p>[^3]: <a href="https://en.wikipedia.org/wiki/ACM_Computing_Classification_System" rel="nofollow">https://en.wikipedia.org/wiki/ACM_Computing_Classification_S...</a><p>[^4]: <a href="https://www.ala.org/tools/topics/atoz" rel="nofollow">https://www.ala.org/tools/topics/atoz</a><p>[^5]: <a href="https://github.com/blankenshipz/clerk/tree/main">https://github.com/blankenshipz/clerk/tree/main</a><p>[^6]: <a href="https://pitt.libguides.com/metadatadiscovery/controlledvocabularies" rel="nofollow">https://pitt.libguides.com/metadatadiscovery/controlledvocab...</a>