For those interested in related/alternative approaches, one or more of the following established open-source libraries might appeal to you:<p>- Snorkel (training data curation, weak supervision, heuristic labeling functions, uncertainty sampling, relation extraction): <a href="https://github.com/snorkel-team/snorkel" rel="nofollow">https://github.com/snorkel-team/snorkel</a><p>- AllenNLP (many pretrained NLP research models for tasks beyond text classification, model training and serving, visualization/interpretability utilities): <a href="https://github.com/allenai/allennlp" rel="nofollow">https://github.com/allenai/allennlp</a><p>- Spacy (tokenization, NER/POS + tagging visualizer, pretrained word vectors, integration with DL models): <a href="https://github.com/allenai/allennlp" rel="nofollow">https://github.com/allenai/allennlp</a><p>-huggingface Transformers (latest and greatest pretrained models, e.g. BERT): <a href="https://github.com/huggingface/transformers" rel="nofollow">https://github.com/huggingface/transformers</a><p>- ...or a barebones “from scratch” solution in less than an hour with a Colab notebook and scikit-learn (preprocess text into tf-idf vectors, LSA/NMF to generate “document embeddings”, visualize embeddings with t-SNE/UMAP [facilitates weak supervision/active learning], classify with LogReg/RF/SVM/whatever). You could also tack on pretrained gensim/TF/PyTorch models quite easily as a next step. But this basic flow quickly gives you a handle on your corpus.<p>By the way, the docs for DeepDive (the predecessor of Snorkel) are some amazingly detailed background reading: <a href="http://deepdive.stanford.edu/example-spouse" rel="nofollow">http://deepdive.stanford.edu/example-spouse</a>