This post describes a research project I did which ended up needing domain-specific natural language processing (NLP).<p>The project started out with learning Bayesian Networks. The nets were learned to determine probabilities of maintenance actions on machines. But, once we explored the data, we realized extensive pre-processing was needed. Most of the information about actions was in free-form descriptions, and not tabulated numerically.<p>So, the choice was between using a large language model (LLM), remotely or locally, or a more rustic NLP approach. I chose the latter for:<p>1. Explainability. Smaller models are easier to decompose and analyze.<p>2. Security. Using LLMs would likely require cloud-based or off-premise devices. They were an additional security hurdle in our case.<p>3. Speed. This was a very domain-specific dataset with relatively few training instances. Fine-tuning LLMs on this would take time (data collection + training).<p>4. Performance. The use-case was not generative text, but text retrieval for Bayesian reasoning. We could get away with NLP models that could only adequately infer similarity between text.