TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Where can I find the latest info for essay ranking, spam filtering?

2 点作者 lkrubner大约 1 年前
I used to do a lot of work with spam filtering. I once worked at a company that had set up hundreds of marketing websites, all of which were the start of a sales funnel that fed into Marketo and then into LeadSpace and then in Salesforce. A good response from potential customers looked like this:<p>&quot;I am interested in the pricing for MaxaMegaAI. Do you have a free tier for a startup with less than 10 developers?&quot;<p>or:<p>&quot;Can your ETL tool handle different systems for geospatial calculations?&quot;<p>Bad responses looked like:<p>&quot;None&quot;<p>or:<p>&quot;Damn&quot;<p>Or:<p>&quot;sdefedflkjlkjsdfsdlkfjlskdfj&quot;<p>I wrote simply machine learning scripts to automate some of our spam filtering.<p>I have the impression this has come a long way?<p>I think this category of machine learning is sometimes called &quot;essay ranking.&quot;<p>I&#x27;ve been away from this kind of work for 7 years. I assume nowadays, with LLMs, there might be some advanced techniques that can be easily implemented?<p>Can someone point me towards a good resource?

1 comment

PaulHoule大约 1 年前
I process text through<p><a href="https:&#x2F;&#x2F;www.sbert.net&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.sbert.net&#x2F;</a><p>and apply a classical machine learning algorithm such as the probability calibrated SVM. This usually beats bag-of-words classifiers as it is able to suss some of the meaning of words. The advantage of this approach is that it very fast (maybe 30 seconds to reliably train a model)<p>It is also possible to “fine tune” a BERT family model using tools from Huggingface like so<p><a href="https:&#x2F;&#x2F;huggingface.co&#x2F;docs&#x2F;transformers&#x2F;training" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;docs&#x2F;transformers&#x2F;training</a><p>my experience is that this takes more like 30 minutes to train a model but the process is not so reliable. For some tasks this performs better than the first approach but I haven’t gotten it to reliably improve on my current models for my tasks.<p>I am planning on fine-tuning a T5 model when I have a problem that I think it will do well on.
评论 #40139484 未加载