TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Is widespread use of non-commercial datasets an open secret in startups?

3 点作者 lamename5 个月前
I&#x27;m asking outside of language models. Aside from foundational models, I still see small companies with very specific goals, but even niche offshoots at bigger companies. One still needs data sets for performant, custom models. Collecting that data can be a hindrance, but some companies succeed anyway, with no appearance of data collection efforts. This is true for language, vision, etc.<p>I suspect that many of these are bootstrapped with pretrained models, many of which surprisingly do have non-commercial licenses or were trained with non-commercially licensed data sets.<p>So is it an open secret that companies just suck up whatever they can get their hands on anyway? Perhaps the legal landscape is still so grey?

1 comment

talldayo5 个月前
Maybe for demos? If you&#x27;re dumb enough to ingest unlicensed content for your commercial application then you deserve all the flak that&#x27;s coming at you. I doubt any startups that are serious about scaling will deliberately use illegal data when you can synthesize or curate a better and more legal dataset at relatively low cost.