Great post, for this pipeline I would have probably used a makefile for the batch pipeline instead of airflow just to keep it simple. I would also make my sink a SQLite database so that you can easily search through it with a web interface using datasette.<p>For the places where bash was used I would just use python and any cli tools you want to call I just use subprocess. It’s much simpler and I can run the scripts in a repl and execute cells in Jupyter or just normal pycharm so its quick and interactive.<p>Love that you included something on building a data dictionary, I am honestly guilty of in the past not including a good data dictionary for the source data. I would just leave in the output of df.describe() or df.info() at the top of the jupyter notebook where you restructure the source data before processing it. I now think you should include and save as a CSV a data dictionary of the source data and the final data as it’s more maintainable or at least leave a comment in your script.<p>Otherwise everything else is pretty similar to what I would do, I just went to my google takeout and apparently all my google play data and songs are gone so I guess I can’t try this myself…
So are the mp3 files not the same as what the author uploaded? I could imagine weird organization for tracks from the service but for self-uploaded data I would be surprised if they didn't just give them back the same.<p>The article never mentioned how this showed up in the GPM app itself which feels lacking.<p>Otherwise a nice article but it reminds me why I long ago gave up on media metadata organization. So much work, so much mess...
> The script should be decently self-explanatory [...] Please note that this is all single-threaded, which I don’t recommend - with nohup and the like, you can trivially parallelize this.<p>How do you parallelize a loop in bash without getting all the echo's intertwined and jumbled together?