To data scientists, machine learning engineers and data engineers -- how do you manage your datasets? What tools and workflows, if any, do you use to version your data alongside your code?<p>Currently, my workflow for data analyses / modelling is essentially:<p>1. Write SQL query for desired dataset<p>2. Run query to produce CSV<p>3. Hash the file as an identifier<p>4. Upload the file to S3<p>5. Reference the file in Jupyter notebook / scripts etc.<p>6. Return to step 1 or 2 (depending on if I'm updating a report, or creating a new experiment with new data).<p>I'm curious if people have experience using tools such as DVC [0] for managing experiments. Git LFS could be useful, but it seems to be aimed more at binary assets, not large datasets of many GBs.<p>[0] https://dvc.org/