If you can't reproduce the model then it's not open-source

230 pointsby mgregover 1 year ago

22 comments

> Imagine if Linux published only a binary without the codebase. Or published the codebase without the compiler used to make the binary. This is where we are today.This was such a helpful way to frame the problem! Something felt off about the "open source models" out there; this highlights the problem incredibly well.

评论 #39031433 未加载

评论 #39033052 未加载

评论 #39032754 未加载

评论 #39036699 未加载

评论 #39034294 未加载

评论 #39034622 未加载

评论 #39036304 未加载

评论 #39031583 未加载

elashriover 1 year ago

I think the process of data acquisition isn't so clear-cut. Take CERN as an example: they release loads of data from various experiments under the CC0 license [1]. This isn't just a few small datasets for classroom use; we're talking big-league data, like the entire first run data from LHCb [2].On their portal, they don't just dump the data and leave you to it. They've got guides on analysis and the necessary tools (mostly open source stuff like ROOT [3] and even VMs). This means anyone can dive in. You could potentially discover something new or build on existing experiment analyses. This setup, with open data and tools, ticks the boxes for reproducibility. But does it mean people need to recreate the data themselves?Ideally, yeah, but realistically, while you could theoretically rebuild the LHC (since most technical details are public), it would take an army of skilled people, billions of dollars, and years to do it.This contrasts with open source models, where you can retrain models using data to get the weights. But getting hold of the data and the cost to reproduce the weights is usually prohibitive. I get that CERN's approach might seem to counter this, but remember, they're not releasing raw data (which is mostly noise), but a more refined version. Try downloading several petabytes of raw data if not; good luck with that. But for training something like a LLM, you might need the whole dataset, which in many cases have its own problems with copyrights…etc.[1] <a href="https://opendata.cern.ch/docs/terms-of-use" rel="nofollow">https://opendata.cern.ch/docs/terms-of-use</a>[2] <a href="https://opendata.cern.ch/docs/lhcb-releases-entire-run1-dataset" rel="nofollow">https://opendata.cern.ch/docs/lhcb-releases-entire-run1-data...</a>[3] <a href="https://root.cern/" rel="nofollow">https://root.cern/</a>

评论 #39034202 未加载

albert180over 1 year ago

I think the biggest issue is with publishing the datasets. Then people and companies would discover that it's full of their copyrighted content and sue. I wouldn't be surprised if they slurped in the whole Z-Library et Al into their models. Or Google their entire Google Books Dataset

评论 #39032085 未加载

anticorporateover 1 year ago

The Open Source Initiative, who maintain the Open Source Definition, have been running a whole series over the past year to collect input from all sorts of stakeholders about what it means for an AI to be open source. I was lucky enough to participate in an afternoon long session with about a hundred other people last year at All Things Open.<a href="https://deepdive.opensource.org/" rel="nofollow">https://deepdive.opensource.org/</a>I encourage you to go check out what's already being done here. I promise it's way more nuanced than anything than is going to fit on a tweet.

评论 #39031895 未加载

mgregover 1 year ago

Applying the term "open source" to AI models is a bit more nuanced than to software. Many consider reproducibility the bar to get over to earn the label "open source."For an AI model that means the model itself, the dataset, and the training recipe (e.g. process, hyperparameters) often also released as source code. With that (and a lot of compute) you can train the model to get the weights.

darrenBaldwin03over 1 year ago

Same with open-core - if you can't self-host the thing on your own infra then its not REALLY OSS

评论 #39033407 未加载

tqiover 1 year ago

"the project does not benefit from the OSS feedback loop" It's not like you can submit PRs to training data that fixes specific issues the way you can submit bug fixes, so I'm skeptical you would see much of a feedback loop."it’s hard to verify that the model has no backdoors (eg sleeper agents)" Again given the size of the datasets and the opaque way training works, I am skeptical that anyone would be able tell if there is a backdoor in the training data."impossible to verify the data and content filter and whether they match your company policy" I don't totally know what this means. For one, you can/probably should apply company policies to the model outputs, which you can do without access to training data. Is the idea that every company could/should filter input data and train their own models?"you are dependent on the company to refresh the model" At the current cost, this is probably already true for most people."A true open-source LLM project — where everything is open from the codebase to the data pipeline — could unlock a lot of value, creativity, and improve security." I am overall skeptical that this is true in the case of LLMs. If anything, I think this creates a larger surface for bad actors to attack.

评论 #39038964 未加载

评论 #39034616 未加载

andy99over 1 year ago

I don't agree, and the analogy is poor. One can do the things he lists with a trained model. Having the data is basically a red herring. I wish this got more attention. Open/free software is about exercising freedoms, and they all can be exercised if you've got the model weights and code.<a href="https://www.marble.onl/posts/considerations_for_copyrighting_AI.html" rel="nofollow">https://www.marble.onl/posts/considerations_for_copyrighting...</a>

评论 #39033275 未加载

评论 #39034266 未加载

tbrownawover 1 year ago

> The “source code” for a work means the preferred form of the work for making modifications to it.-- gplv3These AI/ML models are interesting in that the weights are derived from something else (training set), but if you're modifying them you don't need that. Lots of "how to do fine-tuning" tutorials floating around, and they don't need access to the original training set.

cpetersoover 1 year ago

Are there any true open-source LLM models, where all the training data is publicly-available (with a compatible license) and the training software can reproduce bit-identical models?Is training nondeterministic? I know LLM outputs are purposely nondeterministic.

评论 #39043247 未加载

beardywover 1 year ago

I think the answer is in the name. The "source" has always been what you need to build the thing. In this context I think we can agree that the thing is the model. Based on that the model is no more open source than a binary program.

declaredappleover 1 year ago

I'll venture to say the majority of these "open access models" are meant to serve as advertisements of capabilities (either of hardware, research, or techniques) and nothing more. MPT being one of the most obvious example.Many don't offer any information, some do offer information but provide no new techniques and just threw a bunch of compute and some data to make a sub-par model that shows up on a specific leaderboard.Everyone is trying to save a card up their sleeve so they can sell it. And showing up on scoreboards is a great advertisement.

pabs3over 1 year ago

I like how Debian's machine learning policy says this:<a href="https://salsa.debian.org/deeplearning-team/ml-policy/" rel="nofollow">https://salsa.debian.org/deeplearning-team/ml-policy/</a>

nathanasmithover 1 year ago

Publish your data and prepare to get vilified by professional complainers because the data doesn't conform to their sensibilities. Lots of downside with very little of the opposite.

ramesh31over 1 year ago

No, but it's still insanely useful and free as in beer.

belvalover 1 year ago

> if you can’t reproduce the model then it’s not truly open-source.Open-source means open source, it does not make reproducibility guarantees. You get the code and you can use the code. Pushed to the extreme this is like saying Chromium is not open-source because my 4GB laptop can't compile it.Getting training code for GPT-4 under MIT would be mostly useless, but it would still be open source.

评论 #39032091 未加载

评论 #39031606 未加载

emadmover 1 year ago

We made our last language model fully reproducible including all datasets, training details, hyper parameters etc: <a href="https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo" rel="nofollow">https://stability.wandb.io/stability-llm/stable-lm/reports/S...</a>

Der_Einzigeover 1 year ago

95% of the value comes from the model being freely downloadable and analyzable (i.e. not obfuscated/crippled post-hoc). Sure there is some difference, but as researchers I care far more about open access than making every "gnuight" on the internet happy that we used the right terminology.

edoardo-schnellover 1 year ago

So, we need something like dockerfiles for models?

fragmedeover 1 year ago

it's model available, not open source!

robblbobblover 1 year ago

Agreed.

RcouF1uZ4gsCover 1 year ago

I would argue that while technically correct, it is not what most people really care. What they care about are the following:1. Can I download it?2. Can I run it on my hardware?3. Can I modify it?4. Can I share my modifications with others?If those questions are in the affirmative, then I think most people consider it open enough, and it is a huge step for freedom compared to the models such as OpenAI.

评论 #39031835 未加载

评论 #39032121 未加载

评论 #39036509 未加载

评论 #39033030 未加载