AI weights are not open “source”

238 点作者 subomi将近 2 年前

35 条评论

ndriscoll将近 2 年前

The complexity described seems to be resting on the unestablished idea that weights are copyrightable in the first place. If they're not, then presumably "available weights", "ethical weights", and "open weights" are all the same: open weights. Either your weights are under NDA and presumably considered to be a trade secret, or they are public, and the words in your "license" mean absolutely nothing? That seems like a rather important point to bring up when discussing the licensing landscape for weights...

评论 #36603495 未加载

评论 #36602356 未加载

评论 #36602322 未加载

评论 #36602713 未加载

评论 #36602717 未加载

评论 #36606720 未加载

评论 #36602269 未加载

评论 #36602264 未加载

tiffanyg将近 2 年前

AI licensing is extremely complex. Unlike software licensing, AI isn’t as simple as applying current proprietary/open source software licenses. AI has multiple components—the source code, weights, data, etc.—that are licensed differently.Are you joking? This isn't wrong, per se, but it's worded as though written by someone with only the most casual / cursory interaction and knowledge of this area of law / commerce (e.g., including licensing, copyright, trademark / service mark, patent, etc.) ... until perhaps quite recently.Yes, the AREA IS complicated. No, so-called "AI" is not introducing all sorts of novel issues, structures, etc. "AI" has some nuances distinct from much of what has come before (happens basically every time more significant tech comes along) and some possibly more unique questions related to economics, ethics, philosophy, and the like, but the relevant areas of law and practice have often been complicated and sort of "bleeding edge", even going back before the industrial revolution.Big money, powerful tech, large-scale economic forces, etc. = lots of maneuvering, legislation, litigation, etc. = complicated "rules of the game".Drawing the distinction vs. software in general is reasonable - but, the rather click-baity headline and "I just learned about 'IP' law and bah gawd y'all are doin' it wrong" tone to the start of this article suggest, to me, that this isn't likely to be the best article to use as a reference to learn about these issues.

评论 #36602189 未加载

jkeisling将近 2 年前

The article makes a good point: we should prevent “open-washing” and draw a distinction between well-intentioned restrictive licenses like “Open”RAIL and true open source. However, I worry the name “ethical source” is itself a bit question-begging. While outfits like Bloom may believe in good-faith ethical principles, their definition of ethics isn’t necessarily everyone’s. If restricted models are “ethical”, is releasing open weights “unethical”? Conversely, is releasing a model with PII or artist styles in it “ethical” if a few known use cases are forbidden? There’s no one right answer. Labeling any one set of restrictions as “ethical” off the bat makes discussion harder and puts open source on the back foot to justify “not being ethical”. Better to just call them “restricted models” or “guarded models”, and leave it to individuals to decide if these restrictions are beneficial or not.

评论 #36602745 未加载

mellosouls将近 2 年前

Hmm. Makes a few unsubstantiated claims, with hand-wavy appeals to risks that our private corp overlords are presumably protecting us humble users from, now that they've built their product on open source and data by closing it down and changing terminology to suit.There's an intelligent discussion to be had, and I think this otherwise-reasonable article could be part of it if it toned down the presumption and condescension a little.

zarzavat将近 2 年前

Weights might be copyrightable but in no universe are they copyrightable by OpenAI, Google, etc just because they did the training and spent money on GPUs.The only people who can possibly own the copyright, if any such copyright exists, are the authors of the training data.I find this whole discussion about copyright of weights almost absurd, the incredible amount of deference given to our corporate lords is such that we are “hallucinating” new forms of IP protection for NN weights that have never existed in any kind of statue or case law and cut completely against the grain of all the law that currently exists.

评论 #36611072 未加载

bee_rider将近 2 年前

> The ethical license category applies to licenses that allow commercial use of the component but includes field of endeavor and/or behavioral use restrictions set by the licensor.I don’t love the name, “ethical license” sounds like a description of the license: this license is ethical. Really this sort of license imposes a particular ethical framework on the user.Not to throw shade, though. It is actually hard to come up neutral sounding name for this sort of license I think. I keep thinking of things like “morality encumbered license,” but that sounds ridiculously euphemistic in a weird way.

评论 #36602017 未加载

评论 #36602155 未加载

评论 #36607905 未加载

habitue将近 2 年前

One thing I don't see discussed enough is that, ok let's say the weights are unencumbered, and the source is under an OSI license: the point of open source licenses and free software was to expose the *human understandable* meaning of the final program.That's why distributing binaries isn't allowed even though technically all of the functionality is present in the machine code. AI weights are basically binary blobs. We don't know what they mean, there is really no source code for them. The best we can do is various black box manipulations on them like LoRA, etc, similar to what we can do to a binary blob.

评论 #36602306 未加载

评论 #36602490 未加载

dahart将近 2 年前

> Some people have the perspective that if a license isn’t open source, it’s proprietary. I think it’s more nuanced than that and believe there are three more license types worth naming: non-commercial NDA, non-commercial public, and ethical.It’s very useful to remember the U.S. government definition of commercial software: it is software that “Has been sold, leased, or licensed to the general public” [1]This means that a “non-commercial license” is a bit of an oxymoron to a lot of people. Their definition of commercial includes all software with a license, and does not depend on whether the software costs money. (Perhaps not entirely unlike how FSF does not define “free software” based on whether it costs money.)[1] <a href="https://www.acquisition.gov/far/2.101" rel="nofollow noreferrer">https://www.acquisition.gov/far/2.101</a>

cpcallen将近 2 年前

I'm disappointed that the article is only making the (somewhat pedantic) distinction between source code and weights. From the quotation marks in the headline I hoped that it would instead be making the distinction between human-readable source code and machine-readable compiled form.For example, IMHO (IANAL) an AI code-completion tool that had been trained on GPL software is (or should be) only be legal to distribute if it is accompanied by the training code _and all the code ingested during training_ (or an offer to provide such code upon request).

评论 #36605615 未加载

meindnoch将近 2 年前

According to whom?Weights are a type of program, which are interpreted by the neural network runtime. Same as Java bytecode interpreted by the JVM runtime.

评论 #36601882 未加载

评论 #36601827 未加载

评论 #36601913 未加载

评论 #36601928 未加载

评论 #36605523 未加载

TheRealPomax将近 2 年前

So, Open Data. Got it. This is the same category as config files that are kept up to date by a program as it runs.- Is it "a program"? Very clearly not.- Is it source code? You can argue either way. The program won't work without it, but "this specific one" is not required for the program to do something, and that ambiguity means you probably don't want to call it "source code" because it's too vague.- Is it data used by a program in order to perform its task? Absolutely. It even uniquely defines the program behaviour, and so is a thing onto itself within the context of the program it's used by.

TZubiri将近 2 年前

Agreed, output weighs are target code, and no one would argue the contrary. Companies pretending to publish source code is nothing new.Stallman defines source code as "the preferred way in which developers modify the program"I wrote for wikipedia once that"Stallman's definition thus contemplates JavaScript and HTML's source-target ambivalence, as well as contemplating possible future forms of software production, like visual programming languages, or datasets in Machine Learning."So the datasets could be a form or source code, but the most appropriate source code would be the code that crawls or downloads the dataset and modifies it.Clear as water

kmeisthax将近 2 年前

>While the RAIL organization suggests adding the word “Open” to RAIL licenses that include similar open-access and free-use as open source (i.e. OpenRAIL-M), this is confusing since the license is not open source so long as it includes usage restrictions. A better name would be EthicalRAIL-M. Using the term “ethical” to describe this category license clearly indicates its functional difference from open source licenses.I don't even think we should be using the word "ethical" because it implies that anything more permissive is unethical. We should call these morality clause licenses.The question of whether or not we should have morality clauses involved is complicated. Most bad actors do not give a shit about the licensing status of the code they are using. And these licenses also cause headaches for people who want to follow the rules[0] and avoid copyleft trolling[1]. On the other hand, the morality clauses in OpenRAIL-M are relatively straightforward and non-obnoxious.[0] This also applies to "non-commercial" licensing, since that is a concept entirely foreign to copyright law. As far as I'm concerned the 'NC' clause in Creative Commons just means 'OK to torrent'.[1] A practice in which people abuse copyleft licenses to try and extract licensing agreements for minor license violations. The forgiveness periods added to GPLv3 and later versions of Creative Commons are specifically to prevent this behavior.

horsawlarway将近 2 年前

If anything - this entire conversation just highlights (Over and Over and Over and Over again) how absolutely bonkers abusive our current copyright laws are.The vast majority of small individuals are compelled by contract to surrender their rights to large corporations. Those large corporations then abuse the ever loving fuck out of those rights.The express intent of copyright is now a sad joke.Personally - I'm pretty over the entire show. This system is generating an incredible amount of inequality. New and novel content is absolutely NOT getting made, and these laws are creating vicious infights that drain resources from well intentioned companies & individuals and pass them along to complete scam corporations.We are told stories as children that we cannot retell in our own voices decades later to our own children.I am firmly ready to burn this copyright system to the fucking ground. It's been 300 years since the Statute of Anne - I'm ready for a different game.

评论 #36603933 未加载

评论 #36603416 未加载

评论 #36611940 未加载

评论 #36604138 未加载

morpheuskafka将近 2 年前

> AI also poses socio-ethical consequences that don’t exist on the same scale as computer software, necessitating more restrictions like behavioral use restrictionsThere's plenty of software that has, or could have, similar restrictions. Consider software that allows you to plan vantage points for a shooting or estimate the impact of using explosives at various locations. And the government regulates all sorts of software for export/download because it has military use--everything from development tools to high performance chips that could be used to crunch numbers for a nuclear program, CAD software that can help you build (or destroy) a bridge, etc. The CPUs and GPUs themselves are regulated at certain performance levels, I think.None of this is really new to AI.

robomartin将近 2 年前

Can someone give me a legal answer to this?People, from early school, all the way up to university, use copyrighted materials to learn various topics and obtain degrees. This trains our brains using the work of others.The same is true as we navigate life. We learn various skills and subjects consuming the work of others.And, yes, in the case of most people, we use that training to pursue various careers, obtain work and get paid for it.How can there be a claim of infringement on the part of LLM's and not on every person who has ever used a book, website, article, video or publication to learn something?

评论 #36612028 未加载

评论 #36603342 未加载

评论 #36603322 未加载

Zetobal将近 2 年前

If my own data is in the dataset even when I didn't give consent is it a collaborator dataset?

评论 #36601952 未加载

c7b将近 2 年前

Imho the weights are the real meat for most typical models, you can run with them and continue training them with your own code. It's not even guaranteed that the original code would be very useful for that.But if you are going to make that distinction, for which you can make a case I think, shouldn't you include a third dimension, 'data'? The code alone is hardly useful if you want to rebuild the weights, but all it tells you is that they're loading their proprietary data and then using PyTorch to set up and train the model. You can't reproduce anything using just that. So the real equivalent of open source would be imho either open weights, or open data plus code plus weights (the latter are arguably redundant, but still practical to include). Given that the size of that repo will typically be gigantic, I think open weights is the case we should really be focusing on. I'd rather have a paper explaining the model together with the weights, rather than code that I can't run anyway, if I'm designing an algorithm to continue training the model.

soultrees将近 2 年前

Just a thought, what would happen if all copyrights were abolished or the generative AI revolution we’ve been seeing will continue to the point where almost everything machine generated to a point, and therefore, open season if derivative copyright isn’t protected, then what would actually happen to the US economy?I don’t buy the argument that people just won’t innovate anymore as there won’t be an incentive anymore just doesn’t cut it. There are multiple motivations that exist simultaneously, for example - governments have a motivation to technologically advanced compared to peer nations, humans have an inherent desire to create, power, notoriety, etc etc.So in that case, let’s just abolish the first layer of incentive that actually uncovers more greed than anything by absurd copyright laws and just open the flood gates and get rid of all copyright. We need some real innovation and all this babble about who owns an ‘idea’ is way too restricting.

worksonmine将近 2 年前

> Unlike software licensing, AI isn’t as simple as applying current proprietary/open source software licenses. AI has multiple components—the source code, weights, data, etc.—that are licensed differently.Software also has multiple components, often the same as the ones listed by the author. But what do I know, to me AI is just another example of software.

TrackerFF将近 2 年前

Weights are just matrices with values between a certain range. So are digital images - just matrices with values. Images are covered by copyright laws, so why shouldn't weights also be?

FrustratedMonky将近 2 年前

Are the weights in our brain copyrightable?Might want to get ahead of the curve on this one. How would this work? Would I get a tattoo with a license spelling out covering the contents of my body?

评论 #36605351 未加载

light_hue_1将近 2 年前

I think this is very shortsighted.Weights are a program. CUDA is an interpreter for that program.One day we will be able to decompile these programs into something more human understandable.

low_tech_punk将近 2 年前

The lack of freedom to modification makes it not "open" either.Comparing to traditional software, weights are actually worse than binary. You can't "decompile" the weights into the training source code so there is no way for the community to make useful changes to them.

barbariangrunge将近 2 年前

completely off topic, but funny: I misread "opencoreventures" as "opencorevultures"

评论 #36602094 未加载

amelius将近 2 年前

Just like you can't de-compile a binary without loss of information, "source" means that you can reconstruct it, so the training data should be available as well as the code that was used to train it, and the build script that invoked it.

cf141q5325将近 2 年前

A focus on licensing ignores that there are security incentive to not run just any weights you find floating around the net. Getting exploited through miss-aligned networks is a very real threat and really hard to combat.

adamsmith143将近 2 年前

The question shouldn't be whether the weights are copyrightable but whether they are protected under other electronic communication/data privacy laws.

ronsor将近 2 年前

Model weights are not source code, but data. Arguably because of how they are generated, they are not even copyrightable at all.

评论 #36608617 未加载

评论 #36602561 未加载

mensetmanusman将近 2 年前

Weights are an information asset that require millions in capital and burned-out GPUs to mine and refine.

评论 #36607087 未加载

Makhini将近 2 年前

What if you change the weights slightly? Kaboom, not breaking the copyright anymore.

评论 #36602108 未加载

seydor将近 2 年前

If it is extremely complex, then it can only be modeled by an AI

Topfi将近 2 年前

This post did cover many of the same ideas I have been ruminating on concerning model weights and the nomenclature of current efforts. That's also why I generally tend to stick with calling these[0] "local/self hosted models" for the time being. A major reason for my reluctance is that I see weights far closer to binary than code, making a distinction important and current FOSS concepts not really applicable.Of course, this all hinges on the idea that weights by themselves are inherently protected by current copyright, which still seems to be an unsettled topic, hotly debated by both laypeople and legal professionals. Authors generally are afforded copyright on their work by default, and weights raises so many questions concerning authorship that have never been considered.This being such a contested issue, which will require new laws and/or precedent (depending on the legal system), is very problematic. Regardless of where you live, generally courts and government entities are not famous for their speedy reaction to new things, so clarity may take a while, at which point the industry might have already settled on some agreement that then may be adopted as a basis for actual legislation, which would likely favor financially well baked entities already actively lobbying for their interests, such as OpenAI.Some have also pointed out that this is arguing semantics, and I am tempted to agree in principle, but also want to emphasize that I feel this is a situation where that can be valuable. Should weights in some way be afforded copyright protection, clear nomenclature will be needed. Putting some thought into this now is definitely not the worst idea.I very strongly feel that the specific word "ethical" as part of defining licenses is not the best idea, though. "Ethical" can carry vastly different connotations, depending on a myriad of factors, many of which would go beyond the use-focused definition laid out in the post. Due to this, I'd argue for "behavioral" or "restricted use" over "ethical", as both more clearly state what the intended effect is in cases such as Open RAIL-M[1].Part of my strong feelings on the use of the word "ethical" come from the fact that with weights and training data, there has been a lot of discussion concerning both rights of and considerations for creators whose published works have been used to create those weights. Due to this, the use of "ethical" referring to a group of licenses could give some the impression that this may indicate that the training data used was "ethically sourced", i.e. in agreement with the original creator. This is something that in my eyes should also have clear labeling, though with weights being very hard to reliably trace back to source data, it currently seems impossible to verify, making this essentially just a good faith effort.[0] <a href="https://huggingface.co/tiiuae/falcon-40b-instruct" rel="nofollow noreferrer">https://huggingface.co/tiiuae/falcon-40b-instruct</a>[1] <a href="https://drive.google.com/file/d/16NqKiAkzyZ55NClubCIFup8pT2jnyVIo/view" rel="nofollow noreferrer">https://drive.google.com/file/d/16NqKiAkzyZ55NClubCIFup8pT2j...</a>

thepangolino将近 2 年前

I've always seen weights as akin to configuration files.

ianbutler将近 2 年前

I'm not sure OCV gets to decide any of this. Just like I don't think OSI trying to be the sole dictator of the term "Open Source" works out long term. My opinion is always received controversially about things like this, but terms evolve to meet the common usage by the people. If people are calling this "Open Source", and there are more people who want to call this "Open Source", than people who don't; unless you intend to legally bar them from using the term, with actual action, like a lawsuit or something then eventually this will will also be encompassed by the term "Open Source" as people know it like it or not.Yes I know this term is currently defined explicitly by OSI, no I don't think language prescriptivism wins out regardless how hard they try with it, and since I haven't seen any of the hundreds of quasi Open Source, but not really, companies get dragged to court over usage of the term, this is all toothless complaining in my view.As to their actual point, I might actually agree with them if it were only the weights being shared. In most cases the configuration is also shared which allows popular frameworks to instantiate the model and then execute it for either inference or further training making the release fully suitable for modification and rerelease. I don't need the exact implementation of FlashAttention they used if I can load the model into Huggingface and use theirs, or mine or whatever.Edit: This obviously doesn't apply to the models who have restrictions placed on usage just in case people think I mean every instance of sharing a model. Those are obviously restricted use and I agree it muddies the term.