I'm asking outside of language models. Aside from foundational models, I still see small companies with very specific goals, but even niche offshoots at bigger companies. One still needs data sets for performant, custom models. Collecting that data can be a hindrance, but some companies succeed anyway, with no appearance of data collection efforts. This is true for language, vision, etc.<p>I suspect that many of these are bootstrapped with pretrained models, many of which surprisingly do have non-commercial licenses or were trained with non-commercially licensed data sets.<p>So is it an open secret that companies just suck up whatever they can get their hands on anyway? Perhaps the legal landscape is still so grey?
Maybe for demos? If you're dumb enough to ingest unlicensed content for your commercial application then you deserve all the flak that's coming at you. I doubt any startups that are serious about scaling will deliberately use illegal data when you can synthesize or curate a better and more legal dataset at relatively low cost.