TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Bad Data and Data Engineering: Dissecting Google Play Music Takeout Data

52 pointsby otter-in-a-suitover 3 years ago

3 comments

faizshahover 3 years ago
Great post, for this pipeline I would have probably used a makefile for the batch pipeline instead of airflow just to keep it simple. I would also make my sink a SQLite database so that you can easily search through it with a web interface using datasette.<p>For the places where bash was used I would just use python and any cli tools you want to call I just use subprocess. It’s much simpler and I can run the scripts in a repl and execute cells in Jupyter or just normal pycharm so its quick and interactive.<p>Love that you included something on building a data dictionary, I am honestly guilty of in the past not including a good data dictionary for the source data. I would just leave in the output of df.describe() or df.info() at the top of the jupyter notebook where you restructure the source data before processing it. I now think you should include and save as a CSV a data dictionary of the source data and the final data as it’s more maintainable or at least leave a comment in your script.<p>Otherwise everything else is pretty similar to what I would do, I just went to my google takeout and apparently all my google play data and songs are gone so I guess I can’t try this myself…
评论 #29692279 未加载
progbitsover 3 years ago
So are the mp3 files not the same as what the author uploaded? I could imagine weird organization for tracks from the service but for self-uploaded data I would be surprised if they didn&#x27;t just give them back the same.<p>The article never mentioned how this showed up in the GPM app itself which feels lacking.<p>Otherwise a nice article but it reminds me why I long ago gave up on media metadata organization. So much work, so much mess...
评论 #29694305 未加载
评论 #29693517 未加载
wodenokotoover 3 years ago
&gt; The script should be decently self-explanatory [...] Please note that this is all single-threaded, which I don’t recommend - with nohup and the like, you can trivially parallelize this.<p>How do you parallelize a loop in bash without getting all the echo&#x27;s intertwined and jumbled together?
评论 #29694254 未加载
评论 #29698160 未加载