I am looking for software to classify documents into 10-20 categories.
The documents are about half-screen to screen long.<p>There are some labeled data (about 50-80 labeled documents per category. not 500 per category), so a few-shot learning might be an option.<p>Algorithms used: it might be something like KNearestNeighbor or some ML/Neural networks (transformers? LLM?).
Should just do the proper classification.<p>Some restrictions:
It should be a "ready to use" pipeline with documentation about training the model, parameter optimization etc.
If possible - there should be some way to use this framework/library without Python (I'm not a Python developer)
For example, the [1] and [2] allow to use command-line interface for everything - it seems using Python is optional for these frameworks.
The SetFit framework (see [3] and [4]) looks quite promising (good results with 8 labeled samples per class!). But requires doing everything in Python.<p>[1] https://fasttext.cc/docs/en/supervised-tutorial.html<p>[2] https://neuml.github.io/txtai/pipeline/text/labels/<p>[3] https://github.com/huggingface/setfit<p>[4] https://www.philschmid.de/getting-started-setfit
SetFit is a great framework for building a text classifier.<p>This is a pretty straight forward problem and a good fit for a standard text classifier as well.<p>Here is an example of fine-tuning a model with txtai: <a href="https://colab.research.google.com/github/neuml/txtai/blob/master/examples/16_Train_a_text_labeler.ipynb" rel="nofollow">https://colab.research.google.com/github/neuml/txtai/blob/ma...</a>