To me, the ML situation looks roughly like this.<p>(1) Model weights are something like a bytecode blob. You can run it in a conformant interpreter, and be able to do inference.<p>(2) Things like llama.cpp are the "bytecode interpreter" part, something that can load the weights and run inference.<p>(3) The training setup is like a custom "compiler" which turns training data to the "bytecode" of the model weights.<p>(4) The actual training data is like the "source code" for the model, the input of the training "compiler".<p>Currently (2) is well-served by a number of open-source offerings. (1) is what is usually released when a new model is released. (1) + (2) give the ability to run inference independently.<p>AFAICT, Red Hat suggests that an "open-source ML model" must include (1), (2), and (3), so that the way the model has been trained is also open and reusable. I would say that it's great for scientific / applied progress, but I don't think it's "open source" proper. You get a binary blob and a compiler that can produce it and patch it, but you can't reproduce it the way the authors did.<p>Releasing the training set, the (4), to my mind, would be crucial for the model to be actually "open source" in the way an open-source C program is.<p>I understand that the training set is massive, may contain a lot of data that can't be easily released publicly but that were licensed for the training purposes, and that training from scratch may cost millions, so releasing the (4) is very often infeasible.<p>I still think than (1) + (2) + (3) should not be called "open-source", because the source is not open. We need a different term, like "open structure" or something. It's definitely more open than something that's only available via an API, or as just weights, but not completely open.