TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: I made a dataset for finetuning embedding models

1 点作者 mihaich大约 1 年前
I made a STSB alternatives, but with dialog&#x2F;assistant samples.<p>I couldn&#x27;t find anything similar online (!), so I built it.<p>The reason I did it was because I needed a very small model that would work well with my React component, and none of the existing 17M models performed adequately.<p>The one I created with this dataset does.<p>Embedding models, like other types of models, can be task-specific, and I didn&#x27;t have any officially recognized task for my needs.<p>The closest is the &quot;sentence similarity&quot; task, but one of the most recognized benchmark for it is STSB and I find STSB to be quite strange.<p>Here is a 5 out of 5 scored example from STSB: &quot;A person cuts an onion.&quot; and &quot;A person is cutting an onion.&quot;<p>Here is a 1 out of 5 scored example from STSB: &quot;A man is playing the flute&quot; and &quot;A man is playing the guitar&quot;.<p>STSB isn&#x27;t what I need for my &quot;real world&quot; task. What I need is a way to find best paragraphs that are answers for the question the user asks. This is why I made that dataset and this is why I fine-tuned an embedding model. It was a fun experience and the model works really well! :)

暂无评论

暂无评论