I used to do a lot of work with spam filtering. I once worked at a company that had set up hundreds of marketing websites, all of which were the start of a sales funnel that fed into Marketo and then into LeadSpace and then in Salesforce. A good response from potential customers looked like this:<p>"I am interested in the pricing for MaxaMegaAI. Do you have a free tier for a startup with less than 10 developers?"<p>or:<p>"Can your ETL tool handle different systems for geospatial calculations?"<p>Bad responses looked like:<p>"None"<p>or:<p>"Damn"<p>Or:<p>"sdefedflkjlkjsdfsdlkfjlskdfj"<p>I wrote simply machine learning scripts to automate some of our spam filtering.<p>I have the impression this has come a long way?<p>I think this category of machine learning is sometimes called "essay ranking."<p>I've been away from this kind of work for 7 years. I assume nowadays, with LLMs, there might be some advanced techniques that can be easily implemented?<p>Can someone point me towards a good resource?
I process text through<p><a href="https://www.sbert.net/" rel="nofollow">https://www.sbert.net/</a><p>and apply a classical machine learning algorithm such as the probability calibrated SVM. This usually beats bag-of-words classifiers as it is able to suss some of the meaning of words. The advantage of this approach is that it very fast (maybe 30 seconds to reliably train a model)<p>It is also possible to “fine tune” a BERT family model using tools from Huggingface like so<p><a href="https://huggingface.co/docs/transformers/training" rel="nofollow">https://huggingface.co/docs/transformers/training</a><p>my experience is that this takes more like 30 minutes to train a model but the process is not so reliable. For some tasks this performs better than the first approach but I haven’t gotten it to reliably improve on my current models for my tasks.<p>I am planning on fine-tuning a T5 model when I have a problem that I think it will do well on.