TechEcho

3 comments

faizshahover 3 years ago

Great post, for this pipeline I would have probably used a makefile for the batch pipeline instead of airflow just to keep it simple. I would also make my sink a SQLite database so that you can easily search through it with a web interface using datasette.For the places where bash was used I would just use python and any cli tools you want to call I just use subprocess. It’s much simpler and I can run the scripts in a repl and execute cells in Jupyter or just normal pycharm so its quick and interactive.Love that you included something on building a data dictionary, I am honestly guilty of in the past not including a good data dictionary for the source data. I would just leave in the output of df.describe() or df.info() at the top of the jupyter notebook where you restructure the source data before processing it. I now think you should include and save as a CSV a data dictionary of the source data and the final data as it’s more maintainable or at least leave a comment in your script.Otherwise everything else is pretty similar to what I would do, I just went to my google takeout and apparently all my google play data and songs are gone so I guess I can’t try this myself…

评论 #29692279 未加载

progbitsover 3 years ago

So are the mp3 files not the same as what the author uploaded? I could imagine weird organization for tracks from the service but for self-uploaded data I would be surprised if they didn't just give them back the same.The article never mentioned how this showed up in the GPM app itself which feels lacking.Otherwise a nice article but it reminds me why I long ago gave up on media metadata organization. So much work, so much mess...

评论 #29694305 未加载

评论 #29693517 未加载

wodenokotoover 3 years ago

> The script should be decently self-explanatory [...] Please note that this is all single-threaded, which I don’t recommend - with nohup and the like, you can trivially parallelize this.How do you parallelize a loop in bash without getting all the echo's intertwined and jumbled together?

Bad Data and Data Engineering: Dissecting Google Play Music Takeout Data

3 comments

Bad Data and Data Engineering: Dissecting Google Play Music Takeout Data

3 comments