TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: I made a dataset for finetuning embedding models

1 pointsby mihaichabout 1 year ago
I made a STSB alternatives, but with dialog&#x2F;assistant samples.<p>I couldn&#x27;t find anything similar online (!), so I built it.<p>The reason I did it was because I needed a very small model that would work well with my React component, and none of the existing 17M models performed adequately.<p>The one I created with this dataset does.<p>Embedding models, like other types of models, can be task-specific, and I didn&#x27;t have any officially recognized task for my needs.<p>The closest is the &quot;sentence similarity&quot; task, but one of the most recognized benchmark for it is STSB and I find STSB to be quite strange.<p>Here is a 5 out of 5 scored example from STSB: &quot;A person cuts an onion.&quot; and &quot;A person is cutting an onion.&quot;<p>Here is a 1 out of 5 scored example from STSB: &quot;A man is playing the flute&quot; and &quot;A man is playing the guitar&quot;.<p>STSB isn&#x27;t what I need for my &quot;real world&quot; task. What I need is a way to find best paragraphs that are answers for the question the user asks. This is why I made that dataset and this is why I fine-tuned an embedding model. It was a fun experience and the model works really well! :)

no comments

no comments