TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Is widespread use of non-commercial datasets an open secret in startups?

3 pointsby lamename6 months ago
I&#x27;m asking outside of language models. Aside from foundational models, I still see small companies with very specific goals, but even niche offshoots at bigger companies. One still needs data sets for performant, custom models. Collecting that data can be a hindrance, but some companies succeed anyway, with no appearance of data collection efforts. This is true for language, vision, etc.<p>I suspect that many of these are bootstrapped with pretrained models, many of which surprisingly do have non-commercial licenses or were trained with non-commercially licensed data sets.<p>So is it an open secret that companies just suck up whatever they can get their hands on anyway? Perhaps the legal landscape is still so grey?

1 comment

talldayo6 months ago
Maybe for demos? If you&#x27;re dumb enough to ingest unlicensed content for your commercial application then you deserve all the flak that&#x27;s coming at you. I doubt any startups that are serious about scaling will deliberately use illegal data when you can synthesize or curate a better and more legal dataset at relatively low cost.