This point got mocked when I raised it some time ago:<p><a href="https://news.ycombinator.com/item?id=42561419">https://news.ycombinator.com/item?id=42561419</a><p>Deepseek promptly fixed it so that their UI responds with 'Hi! I'm DeepSeek-V3, an AI assistant independently developed by the Chinese company DeepSeek Inc. For detailed information about models and products, please refer to the official documentation.' - but only if you ask that as the first question of the conversation. Bury the 'what model are you' question after a few unrelated questions, and it'll happily tell you it's ChatGPT.<p>Is it possible that it's an actual distillation of weights, but into a radically different architecture? We don't have evidence of that, but that would be a great technical feat in itself.<p>Is it trained on a large set of user requests and OpenAI replies? Yes.<p>The question is, were these obtained by simply using the API contrary the user agreement at scale, or was there access to internal OpenAI datasets, or was there some kind of capture of conversations by a man-in-the-middle (which could be any of a number of AI access resellers)?<p>The answer hinges on which _requests_ were in that training set, something that won't be easy to investigate - unless you're OpenAI itself, and can identify 'trap streets' in the archive of all conversations, cases where ChatGPT once gave an unusual response to an unusual request, and DeepSeek just happens to match it.
Microsoft paid $14 billion so they have exclusive hosting accesses to those OpenAI models. Too bad that a free and open weight model appeared online that matches the performance of what they paid $14 billion for.
> Microsoft’s security researchers in the fall observed individuals they believe may be linked to DeepSeek exfiltrating a large amount of data using the OpenAI application programming interface, or API, said the people, who asked not to be identified because the matter is confidential.<p>> Such activity could violate OpenAI’s terms of service or could indicate the group acted to remove OpenAI’s restrictions on how much data they could obtain<p>What do we think this means in practice?<p>"Exfiltrating data" makes it sound like they were taking private chat logs, but I imagine that would be a much bigger deal. I'm assuming it's just using multiple free OpenAI accounts across a bunch of different IP addresses to generate a large training set.
The world's smallest violin playing for OpenAI...<p>No sympathy from me. If you use copyrighted material to build your empire, you don't get to turn around and complain when somebody else does the same (even if they are chinese).
Output of machines is not a creative expression and therefore not copyrightable. At worst the use of ChatGPT for generating training material is against their terms of service and so, is there any recourse besides banning some account used for this? (I actually don't know, has there ever been CFAA prosecution for acting outside of ToS ?)
There's a reason the top labs aren't releasing their frontier models anymore, and instead keep them in-house and use them to fine-tune smaller models. Because it works! It's the same reason o1 doesn't give you the "thinking" steps. Distillation works. It gets you ~80% of the way, as evidenced by the qwen/llama distillations of R1.<p>The "walls" aren't what they appear to be.
What's the problem? At most this might be a ToS violation, but it also seems easy to avoid that (if you care at all). DeepSeek does not even have to be a customer of OpenAI and thus not subject to their ToS.<p>Company A pays OpenAI for their API. They use the API to generate or augment a lot of data. They own the data. They post the data on the open Internet.<p>Company B has the habit of scraping various pages on the Internet to train its large language models, which includes the data posted by Company A. [1]<p>OpenAI is undoubtedly breaking many terms of service and licenses when it uses most of the open Internet to train its models. Not to mention potential copyright violations (which do not apply to AI outputs).<p>[1]: This is not hypothetical BTW. In the early days of LLMs, lots of large labs accidentally and not so accidentally trained on the now famous ShareGPT dataset (outputs from ChatGPT shared on the ShareGPT website).
If I was OpenAI, I would start worrying about my public image.<p>The reason Microsoft could get away with being horrible for many years was that they had moat.
Isn’t OpenAI blocking China?<p><a href="https://www.theguardian.com/world/article/2024/jul/09/chinese-developers-openai-blocks-access-in-china-artificial-intelligence" rel="nofollow">https://www.theguardian.com/world/article/2024/jul/09/chines...</a><p>Chinese developers scramble as OpenAI blocks access in China<p>In other words how could Deepseek, a Chinese company, have entered terms of service with OpenAI?
When looking at this I became suspicious because Deepseek has/had dns records referencing openai and co-pilot. Then I got their chat to tell me what their model is based off of and it said ChatGPT. Hope my own bluesky account is okay for screenshots:<p><a href="https://bsky.app/profile/rosshosman.bsky.social/post/3lgu4c5do622e" rel="nofollow">https://bsky.app/profile/rosshosman.bsky.social/post/3lgu4c5...</a>
The administration will probably block deepseek access to the app stores like they are doing for tiktok for daring to be competitive with us companies.
It's all "you shouldn't have posted it online!" until someone does it to them, then it's all "what ever happened to honor among thieves?"
Who gives a hoot.<p>OpenAI lies and steals and grifts and the sooner the responsibility for or control of any important tech is taken away from them the better.
The problem is not the "improperly obtained", but the "they cannot have built it under a $6M budget, there should be something else going on..."
What are they going to do? Sue DeepSeek in a court in Hangzhou, China? Try and get the model weights taken down from the internet? Good luck with either one...
what legal actions OpenAI can take if this is proven to be true? can DeepSeek be banned in US?<p>I am also curious if they used any probing or watermarking of their models to detect this.
If they really believe billions, trillions, the whole of human future is on the line, it’s going to get messy. American capitalists have excelled at ruthless competition by the sword (which is frequently the legal pen) since the days of pelts, gold, oil, etc.
Well, I can't wait for the day when Microsoft just disappears. All their life trying to stifle innovation and competition, and here we are again, where this time they have essentially been scammed by OpenAI thinking that they could pull off their anti-competitive practices once more with exclusive access to their models, only to then learn that they've lost and resorting to litigating their sorry ass out of the situation, all the while the US government is living a crypto wars dejavu trying to manufacture as much propaganda as possible to make us believe China is the new enemy we should be worried about this time.<p>Yep, nope, thanks. Keep those papers coming, bois. Make those models small enough that they can run locally so we don't depend on an online feudal lord.