Projects like this are inevitable and necessary; 'OpenAI' make such a mockery of their name that it's an invitation to others to try and build an alternative that is actually open.
There is a recent Yannic Kilcher interview about LAION.<p>> LAION-5B: 5 billion image-text-pairs dataset (with the authors)<p><a href="https://www.youtube.com/watch?v=AIOE1l1W0Tw" rel="nofollow">https://www.youtube.com/watch?v=AIOE1l1W0Tw</a><p>A nice recent result (DeepMind) is that you can either make the dataset 4x larger or the network 4x larger to get the same result. So a large dataset could create a more efficient/smaller model and in turn it could be easier to distribute and use.<p><a href="https://www.deepmind.com/publications/an-empirical-analysis-of-compute-optimal-large-language-model-training" rel="nofollow">https://www.deepmind.com/publications/an-empirical-analysis-...</a>
Their marketing is so bad. Terrible website, they present themselves first by opposing OpenAI, they name their datasets the way established orgs name their models. Their only project is a non-curated filtering of already open source data using CLIP (they just looped over it and dropped the image-text pairs with cosine similarity below 0.3).