科技回声

I'm asking outside of language models. Aside from foundational models, I still see small companies with very specific goals, but even niche offshoots at bigger companies. One still needs data sets for performant, custom models. Collecting that data can be a hindrance, but some companies succeed anyway, with no appearance of data collection efforts. This is true for language, vision, etc.<p>I suspect that many of these are bootstrapped with pretrained models, many of which surprisingly do have non-commercial licenses or were trained with non-commercially licensed data sets.<p>So is it an open secret that companies just suck up whatever they can get their hands on anyway? Perhaps the legal landscape is still so grey?

1 comment

talldayo5 个月前

Maybe for demos? If you're dumb enough to ingest unlicensed content for your commercial application then you deserve all the flak that's coming at you. I doubt any startups that are serious about scaling will deliberately use illegal data when you can synthesize or curate a better and more legal dataset at relatively low cost.

Ask HN: Is widespread use of non-commercial datasets an open secret in startups?

1 comment

Ask HN: Is widespread use of non-commercial datasets an open secret in startups?

1 comment