Hey! I'm one of the creators of DataChain.<p>DataChain works on your local machine and manages files in storage (like images and PDFs in S3 or GCP). Users can slice and dice their files using metadata. Example:<p>- Download only files labeled "Cats" instead of the whole dataset. Use json/parque to get labels.<p>- Use LLMs to generate metadata. E.g., "Are there more than 3 people in the image?".<p>- Add custom metadata to create a rich "DataFrame" of your files<p>The API of the data-frame is based on Python (Pydentic) but queries to Pythion objects are transpiled to database (SQLite). Or you can just convert all metadata into Pandas if you prefer.<p>WDYT? I’d love to hear your thoughts!