TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Dataset version control for ML / data science?

1 pointsby eadanabout 5 years ago
To data scientists, machine learning engineers and data engineers -- how do you manage your datasets? What tools and workflows, if any, do you use to version your data alongside your code?<p>Currently, my workflow for data analyses &#x2F; modelling is essentially:<p>1. Write SQL query for desired dataset<p>2. Run query to produce CSV<p>3. Hash the file as an identifier<p>4. Upload the file to S3<p>5. Reference the file in Jupyter notebook &#x2F; scripts etc.<p>6. Return to step 1 or 2 (depending on if I&#x27;m updating a report, or creating a new experiment with new data).<p>I&#x27;m curious if people have experience using tools such as DVC [0] for managing experiments. Git LFS could be useful, but it seems to be aimed more at binary assets, not large datasets of many GBs.<p>[0] https:&#x2F;&#x2F;dvc.org&#x2F;

no comments

no comments