I am currently in charge of deciding on the tech stack for a large scale AI project in the computer vision space.<p>Most things are settled, but we expect to collect a LOT of data that will be labeled and or auto labeled ( to the tune of 100 MIO video clips )<p>We will be training multiple models for different tasks from that data and we need a good system to organize it.<p>Does anybody have any tips experiences with that kind of thing.
We can use any on premise or cloud solution....<p>Specifically we would need<p>* Data ingestion pipeline ( data will come from field personel )
* Data versioning
* Being able to define datasets that are a subset of the whole collected data
* Inexpensive storage ( e.g S3 or similar )
* Branching/Merging for maintaining production training data sets
* Metadata storage and query capabilities ...
* User interface for less tech savy people ( e.g just a git like command line is fine for engineers but not for field personell who are not in IT )<p>I know of tools like https://dvc.org/ but a) they are just layers on top of git b) break appart on huge datasets without a folder hierarchy ( git tree objects just don't work for linear lists of items ) are only useable by IT personell, and require checking out at least a part of the dataset.<p>Our datasets would be 100.000.000 x 100 MB = 10 PB of raw data. Training data should be delivered to training nodes via network etc.. we just can't have a full checkout of that data...
I am working on a solution on top of Git, but storing diffs only, can integrate with MySQL and S3, can create version snapshots. You would have the videos in a bucket and the version-controlled history with links to the videos in another, also on S3. Data can be added as a diff commit and later merged into production datasets. You would own the history via a readable Git repo and pointers to your versioning S3, on top of any snapshots. No UI yet, though. Many open questions, some of them:
- Data ingestion: how often per person, how many new videos each time, how many field personnel?
- Dataset carveouts: how often, based on what exactly would you filter?
- Metadata: which ones per video, how often querying, on a specific version of the datasets? A few query examples would help to imagine where the metadata should live.
My email is in the profile, feel free to reach out, most likely my solution is too early stage for your needs.
We have been working on a data version control tool called Oxen that is tackling many of your needs. Feel free to check it out here:<p><a href="https://github.com/Oxen-AI/oxen-release#-oxen">https://github.com/Oxen-AI/oxen-release#-oxen</a><p>Going down your list of requirements, Oxen has:<p>* Data versioning, similar paradigm to git, but built from the ground up for large ML datasets<p>* Inexpensive storage, comparable pricing to s3<p>* Branching/Merging for maintaining production training data sets<p>* Metadata storage and query capabilities, works with many structured data types. Have APIs for querying.<p>* User interface for less tech savy people, building out a hub at <a href="https://www.oxen.ai" rel="nofollow">https://www.oxen.ai</a> to enable this.<p>* Being able to define datasets that are a subset of the whole collected data (is this a similar requirement to querying?)<p>* Data ingestion pipeline - engineers would have to hook into APIs or CLI tools right now.<p>Feel free to check it out and leave any feedback on the GitHub repo!
If you are just looking for data versioning there is Dolt:<p><a href="https://github.com/dolthub/dolt">https://github.com/dolthub/dolt</a><p>And that has a user-friendly UI in DoltHub:<p><a href="https://www.dolthub.com/" rel="nofollow">https://www.dolthub.com/</a><p>You wouldn't store the images themselves in Dolt, those would likely be links to S3 but al the labels and surrounding metadata could be stored in Dolt?<p>DISCLAIMER: I'm the CEO of DoltHub so this is self-promotion.