Microsoft Probing If DeepSeek-Linked Group Improperly Obtained OpenAI Data

122 pointsby simonpure4 months ago

38 comments

bushido4 months ago

<a href="https://archive.ph/QouOV" rel="nofollow">https://archive.ph/QouOV</a>

tapoxi4 months ago

Oh I see, so training on copyrighted content is fine unless it's your AI model...

评论 #42861307 未加载

评论 #42861661 未加载

评论 #42861624 未加载

评论 #42861637 未加载

评论 #42861694 未加载

评论 #42863065 未加载

评论 #42868968 未加载

评论 #42866988 未加载

评论 #42862553 未加载

waldrews4 months ago

This point got mocked when I raised it some time ago:<a href="https://news.ycombinator.com/item?id=42561419">https://news.ycombinator.com/item?id=42561419</a>Deepseek promptly fixed it so that their UI responds with 'Hi! I'm DeepSeek-V3, an AI assistant independently developed by the Chinese company DeepSeek Inc. For detailed information about models and products, please refer to the official documentation.' - but only if you ask that as the first question of the conversation. Bury the 'what model are you' question after a few unrelated questions, and it'll happily tell you it's ChatGPT.Is it possible that it's an actual distillation of weights, but into a radically different architecture? We don't have evidence of that, but that would be a great technical feat in itself.Is it trained on a large set of user requests and OpenAI replies? Yes.The question is, were these obtained by simply using the API contrary the user agreement at scale, or was there access to internal OpenAI datasets, or was there some kind of capture of conversations by a man-in-the-middle (which could be any of a number of AI access resellers)?The answer hinges on which _requests_ were in that training set, something that won't be easy to investigate - unless you're OpenAI itself, and can identify 'trap streets' in the archive of all conversations, cases where ChatGPT once gave an unusual response to an unusual request, and DeepSeek just happens to match it.

评论 #42862265 未加载

评论 #42862230 未加载

评论 #42867330 未加载

评论 #42861969 未加载

评论 #42861925 未加载

评论 #42862099 未加载

评论 #42861748 未加载

评论 #42861744 未加载

ChuckMcM4 months ago

Oh cry me a river. Read the room Microsoft, you can't have it both ways.

评论 #42861690 未加载

评论 #42862055 未加载

评论 #42864426 未加载

tw19844 months ago

Microsoft paid $14 billion so they have exclusive hosting accesses to those OpenAI models. Too bad that a free and open weight model appeared online that matches the performance of what they paid $14 billion for.

评论 #42861988 未加载

muglug4 months ago

> Microsoft’s security researchers in the fall observed individuals they believe may be linked to DeepSeek exfiltrating a large amount of data using the OpenAI application programming interface, or API, said the people, who asked not to be identified because the matter is confidential.> Such activity could violate OpenAI’s terms of service or could indicate the group acted to remove OpenAI’s restrictions on how much data they could obtainWhat do we think this means in practice?"Exfiltrating data" makes it sound like they were taking private chat logs, but I imagine that would be a much bigger deal. I'm assuming it's just using multiple free OpenAI accounts across a bunch of different IP addresses to generate a large training set.

评论 #42861725 未加载

评论 #42861727 未加载

评论 #42861790 未加载

_Algernon_4 months ago

The world's smallest violin playing for OpenAI...No sympathy from me. If you use copyrighted material to build your empire, you don't get to turn around and complain when somebody else does the same (even if they are chinese).

jazzyjackson4 months ago

Output of machines is not a creative expression and therefore not copyrightable. At worst the use of ChatGPT for generating training material is against their terms of service and so, is there any recourse besides banning some account used for this? (I actually don't know, has there ever been CFAA prosecution for acting outside of ToS ?)

评论 #42862001 未加载

benreesman4 months ago

These guys must really be in some deep shit to pull a stunt like this.Isn’t the earnings call tomorrow? Have fun with that.

评论 #42863384 未加载

NitpickLawyer4 months ago

There's a reason the top labs aren't releasing their frontier models anymore, and instead keep them in-house and use them to fine-tune smaller models. Because it works! It's the same reason o1 doesn't give you the "thinking" steps. Distillation works. It gets you ~80% of the way, as evidenced by the qwen/llama distillations of R1.The "walls" aren't what they appear to be.

评论 #42862090 未加载

neilv4 months ago

I wonder whether bloomberg.com realized what a hilariously rage-baiting headline that is.

blibble4 months ago

oh the ironyalso, I bet the Chinese are quaking in their boots at the thought of an investigation by Microsoft

评论 #42861621 未加载

Palmik4 months ago

What's the problem? At most this might be a ToS violation, but it also seems easy to avoid that (if you care at all). DeepSeek does not even have to be a customer of OpenAI and thus not subject to their ToS.Company A pays OpenAI for their API. They use the API to generate or augment a lot of data. They own the data. They post the data on the open Internet.Company B has the habit of scraping various pages on the Internet to train its large language models, which includes the data posted by Company A. [1]OpenAI is undoubtedly breaking many terms of service and licenses when it uses most of the open Internet to train its models. Not to mention potential copyright violations (which do not apply to AI outputs).[1]: This is not hypothetical BTW. In the early days of LLMs, lots of large labs accidentally and not so accidentally trained on the now famous ShareGPT dataset (outputs from ChatGPT shared on the ShareGPT website).

vrighter4 months ago

so it's ok if they ignore robots.txt and just vacuum up every scrap of data they can, but not when someone else does it to them, iiuc

dathinab4 months ago

they probably want to create enough legal instability to prevent companies from using this model internally for their use cases

chvid4 months ago

If I was OpenAI, I would start worrying about my public image.The reason Microsoft could get away with being horrible for many years was that they had moat.

not_your_vase4 months ago

And, do what? China is famous for their deeply rooted respect for the (assumed and real) rights of foreign companies, right?

chvid4 months ago

Isn’t OpenAI blocking China?<a href="https://www.theguardian.com/world/article/2024/jul/09/chinese-developers-openai-blocks-access-in-china-artificial-intelligence" rel="nofollow">https://www.theguardian.com/world/article/2024/jul/09/chines...</a>Chinese developers scramble as OpenAI blocks access in ChinaIn other words how could Deepseek, a Chinese company, have entered terms of service with OpenAI?

评论 #42862073 未加载

评论 #42862088 未加载

tjpnz4 months ago

Weren't they employing the same strategy against Linux not so long ago?

评论 #42861605 未加载

评论 #42861626 未加载

neonnomad4 months ago

When looking at this I became suspicious because Deepseek has/had dns records referencing openai and co-pilot. Then I got their chat to tell me what their model is based off of and it said ChatGPT. Hope my own bluesky account is okay for screenshots:<a href="https://bsky.app/profile/rosshosman.bsky.social/post/3lgu4c5do622e" rel="nofollow">https://bsky.app/profile/rosshosman.bsky.social/post/3lgu4c5...</a>

评论 #42861767 未加载

评论 #42861792 未加载

评论 #42861914 未加载

评论 #42862103 未加载

hokkos4 months ago

The administration will probably block deepseek access to the app stores like they are doing for tiktok for daring to be competitive with us companies.

siliconc0w4 months ago

I bet they'll reverse course and get Sacks to push for some KYC regs to prevent frontier models from providing China additional training data.

drysine4 months ago

"As the leading builder of AI, we engage in countermeasures to protect our IP"So now it is a problem? )

Kye4 months ago

It's all "you shouldn't have posted it online!" until someone does it to them, then it's all "what ever happened to honor among thieves?"

courseofaction4 months ago

Who gives a hoot.OpenAI lies and steals and grifts and the sooner the responsibility for or control of any important tech is taken away from them the better.

blastonico4 months ago

The problem is not the "improperly obtained", but the "they cannot have built it under a $6M budget, there should be something else going on..."

评论 #42862148 未加载

kittikitti4 months ago

Sounds like Microsoft out litigates instead of out competes.

JSR_FDED4 months ago

Didn’t Stanford take the first version of Meta’s Llama, then use OpenAI (against its TOS) to create Alpaca?

andrewstuart4 months ago

Irony in the deepest sense of the word.

paxys4 months ago

What are they going to do? Sue DeepSeek in a court in Hangzhou, China? Try and get the model weights taken down from the internet? Good luck with either one...

vinni24 months ago

what legal actions OpenAI can take if this is proven to be true? can DeepSeek be banned in US?I am also curious if they used any probing or watermarking of their models to detect this.

评论 #42862750 未加载

jaikant4 months ago

This sounds funny.

827a4 months ago

Hey team, let's wait and see if this post gets flagged off the frontpage of HackerNews like every other anti-DeepSeek post in the past 72 hours.

评论 #42862015 未加载

评论 #42862142 未加载

picafrost4 months ago

If they really believe billions, trillions, the whole of human future is on the line, it’s going to get messy. American capitalists have excelled at ruthless competition by the sword (which is frequently the legal pen) since the days of pelts, gold, oil, etc.

glouwbug4 months ago

Couldn't you just... train off chatgpt's output?

评论 #42861994 未加载

juunpp4 months ago

Well, I can't wait for the day when Microsoft just disappears. All their life trying to stifle innovation and competition, and here we are again, where this time they have essentially been scammed by OpenAI thinking that they could pull off their anti-competitive practices once more with exclusive access to their models, only to then learn that they've lost and resorting to litigating their sorry ass out of the situation, all the while the US government is living a crypto wars dejavu trying to manufacture as much propaganda as possible to make us believe China is the new enemy we should be worried about this time.Yep, nope, thanks. Keep those papers coming, bois. Make those models small enough that they can run locally so we don't depend on an online feudal lord.

snickerbockers4 months ago

Oh, now it matters.

edgineer4 months ago

lol, lmao even

38 comments

bushido4 months ago

<a href="https://archive.ph/QouOV" rel="nofollow">https://archive.ph/QouOV</a>

tapoxi4 months ago

Oh I see, so training on copyrighted content is fine unless it's your AI model...

评论 #42861307 未加载

评论 #42861661 未加载

评论 #42861624 未加载

评论 #42861637 未加载

评论 #42861694 未加载

评论 #42863065 未加载

评论 #42868968 未加载

评论 #42866988 未加载

评论 #42862553 未加载

waldrews4 months ago

评论 #42862265 未加载

评论 #42862230 未加载

评论 #42867330 未加载

评论 #42861969 未加载

评论 #42861925 未加载

评论 #42862099 未加载

评论 #42861748 未加载

评论 #42861744 未加载

ChuckMcM4 months ago

Oh cry me a river. Read the room Microsoft, you can't have it both ways.

评论 #42861690 未加载

评论 #42862055 未加载

评论 #42864426 未加载

tw19844 months ago

评论 #42861988 未加载

muglug4 months ago

评论 #42861725 未加载

评论 #42861727 未加载

评论 #42861790 未加载

_Algernon_4 months ago

jazzyjackson4 months ago

评论 #42862001 未加载

benreesman4 months ago

These guys must really be in some deep shit to pull a stunt like this.Isn’t the earnings call tomorrow? Have fun with that.

评论 #42863384 未加载

NitpickLawyer4 months ago

评论 #42862090 未加载

neilv4 months ago

I wonder whether bloomberg.com realized what a hilariously rage-baiting headline that is.

blibble4 months ago

oh the ironyalso, I bet the Chinese are quaking in their boots at the thought of an investigation by Microsoft

评论 #42861621 未加载

Palmik4 months ago

vrighter4 months ago

so it's ok if they ignore robots.txt and just vacuum up every scrap of data they can, but not when someone else does it to them, iiuc

dathinab4 months ago

they probably want to create enough legal instability to prevent companies from using this model internally for their use cases

chvid4 months ago

If I was OpenAI, I would start worrying about my public image.The reason Microsoft could get away with being horrible for many years was that they had moat.

not_your_vase4 months ago

And, do what? China is famous for their deeply rooted respect for the (assumed and real) rights of foreign companies, right?

chvid4 months ago

评论 #42862073 未加载

评论 #42862088 未加载

tjpnz4 months ago

Weren't they employing the same strategy against Linux not so long ago?

评论 #42861605 未加载

评论 #42861626 未加载

neonnomad4 months ago

评论 #42861767 未加载

评论 #42861792 未加载

评论 #42861914 未加载

评论 #42862103 未加载

hokkos4 months ago

The administration will probably block deepseek access to the app stores like they are doing for tiktok for daring to be competitive with us companies.

siliconc0w4 months ago

I bet they'll reverse course and get Sacks to push for some KYC regs to prevent frontier models from providing China additional training data.

drysine4 months ago

"As the leading builder of AI, we engage in countermeasures to protect our IP"So now it is a problem? )

Kye4 months ago

It's all "you shouldn't have posted it online!" until someone does it to them, then it's all "what ever happened to honor among thieves?"

courseofaction4 months ago

Who gives a hoot.OpenAI lies and steals and grifts and the sooner the responsibility for or control of any important tech is taken away from them the better.

blastonico4 months ago

The problem is not the "improperly obtained", but the "they cannot have built it under a $6M budget, there should be something else going on..."

评论 #42862148 未加载

kittikitti4 months ago

Sounds like Microsoft out litigates instead of out competes.

JSR_FDED4 months ago

Didn’t Stanford take the first version of Meta’s Llama, then use OpenAI (against its TOS) to create Alpaca?

andrewstuart4 months ago

Irony in the deepest sense of the word.

paxys4 months ago

What are they going to do? Sue DeepSeek in a court in Hangzhou, China? Try and get the model weights taken down from the internet? Good luck with either one...

vinni24 months ago

what legal actions OpenAI can take if this is proven to be true? can DeepSeek be banned in US?I am also curious if they used any probing or watermarking of their models to detect this.

评论 #42862750 未加载

jaikant4 months ago

This sounds funny.

827a4 months ago

Hey team, let's wait and see if this post gets flagged off the frontpage of HackerNews like every other anti-DeepSeek post in the past 72 hours.

评论 #42862015 未加载

评论 #42862142 未加载

picafrost4 months ago

glouwbug4 months ago

Couldn't you just... train off chatgpt's output?

评论 #42861994 未加载

juunpp4 months ago

snickerbockers4 months ago

Oh, now it matters.

edgineer4 months ago

lol, lmao even