TechEcho

To data scientists, machine learning engineers and data engineers -- how do you manage your datasets? What tools and workflows, if any, do you use to version your data alongside your code?Currently, my workflow for data analyses / modelling is essentially:1. Write SQL query for desired dataset2. Run query to produce CSV3. Hash the file as an identifier4. Upload the file to S35. Reference the file in Jupyter notebook / scripts etc.6. Return to step 1 or 2 (depending on if I'm updating a report, or creating a new experiment with new data).I'm curious if people have experience using tools such as DVC [0] for managing experiments. Git LFS could be useful, but it seems to be aimed more at binary assets, not large datasets of many GBs.[0] https://dvc.org/

Ask HN: Dataset version control for ML / data science?

no comments

Ask HN: Dataset version control for ML / data science?

no comments