I came here expecting to read about the tech in the article or how others do big data processing stuff. Instead I get off topic Spotify rants.. Did you read the article or just see Spotify in the headline and decided your gripes therefore are relevant?
I'm having a hard time understanding this article. It seems to be a bit too low level on the specifics of Beam for general consumption.<p>From what i undestand, Spark has the same feature built in. If the planner knows that the source data is partitioned and/or sorted appropriately, it can skip shuffling/sorting it, instead having each executor directly requesting the one file it needs.<p>It's a nice optimization, but it's not game changing. You often end up having to shuffle anyway, as you are joining on a different key, or for performance reason you need more executors than the set amount of partitions, or the shuffle needed to write the data doesn't justify the savings on the readers.<p>Maybe it's better with their additional optimizations? Spark does not do those, mostly.
I wonder if they could publish dollar cost of that job before and after the optimization, as provided by GCP billing. I know it could be a bit unfair (some costs may be static, regardless of job size, etc.) but it would improve decision making for others if discussions of public cloud usage optimizations also include the cost.
Regarding the title, how do we know this is THE largest dataflow job? The article body itself only makes mention that this is THEIR largest dataflow job. This post doesn't make any quantifiable claims either one could use to support, this is all I found:<p>> "We estimate around a 50% decrease in Dataflow costs this year compared to previous years’ Bigtable-based approach. Additionally, we avoided scaling the Bigtable cluster up two to three times its normal capacity (up to around 1,500 nodes at peak"<p>The official Spotify Engineering Tweet similarly only makes mention that this is Spotify's largest dataflow job ever: <a href="https://twitter.com/SpotifyEng/status/1359887825047613442" rel="nofollow">https://twitter.com/SpotifyEng/status/1359887825047613442</a>.<p>I'm fairly sure a similar accidental unsourced exaggeration was made last year.<p>Maybe the title should be Spotify Optimized Their Largest Dataflow Job Ever For Wrapped 2020?
Hi not sure if I am just completely off here but I am wondering how this relates or compares to processing things with Kafka and Kafka Streams?<p>If I am reading things correctly with Kafka the workflow equivalent to what's written in the article would be to have your producer produce via hash-based-round-robin (the default partitioning algorithm) based on the key you are interested in into some topic and then your consumer would just read it and your data would already be sorted for the given keys (because within a partition Kafka has sorting guarantees) and also be co-partitioned correctly if you need to read some other topic in with the same number of partitions and the same logical keys produced via the same algorithm. No?
Wrapped works so well because it panders to you. Everyone likes to be acknowledged for listening to the weird indie band they discovered earlier this year.<p>I enjoy it anyways, and Spotify is still a great service for now - I wonder if it'll meet the same fate as Netflix at some point, with publishing houses going for their own streaming services instead.
This article fails to make a clean problem statement for the general audience. It jumps right into jargon and names from some framework/library. It reads like it was an internal report from a programmer to their team, and someone decided to make it public with no changes.
For those that would like to dig more into what SMB is - here's the link to the paper from the article:
<a href="http://kth.diva-portal.org/smash/get/diva2:1334587/FULLTEXT01.pdf" rel="nofollow">http://kth.diva-portal.org/smash/get/diva2:1334587/FULLTEXT0...</a>
I'd be curious how this compares in load to Google's internal applications. I'm also curious what the capacity of Google's infrastructure goes to Google vs. GCE - has combined GCE usage even passed the compute needs of Google internally yet?
This technique of using distributed storage for large joins instead of shuffling between compute nodes also helps make your job robust to spot instance kills. Until disaggregated shuffle services are widely adopted, it can be really handy.
Largest data flow job ever? I’m sure Google would beg to differ. At Quantcast we process 50PB every day, and that’s nothing compared to real scale like Google.<p>And merge joins from sorted data? Joins have been done that way since the punched card days on mainframes (and by any scaled data system)
Perhaps a bit off-topic. But a lot of users (myself included) reported wildly inaccurate data in the spotify wrapped this year with seemingly no explanation, no shared accounts, no re-used passwords, no weird listening history, etc.<p>I wonder if some of the data in the "We worked with the maintainer of these data sets to convert a year’s worth of data to SMB format." step got corrupted or just wrongly converted/lost.<p>I'm not sure how else explain that I have to google artists in my top 10 because I never heard of them.
listenbrainz has gotten pretty good in the last while for keeping track of your music stats. Its got a nice weekly stats page etc. Its also not ad laden like last.fm
Anyone wanna know how much Spotify wanna know about you?<p><a href="https://twitter.com/steipete/status/1025024813889478656" rel="nofollow">https://twitter.com/steipete/status/1025024813889478656</a>
Sorry if it's off-topic, but if anyone's interested, I'm launching volt.fm next week.<p>It connects to your Spotify account and generates a nice public page with your stats (top artists, top tracks), playlists and etc.<p>You can reserve your username now: <a href="https://volt.fm" rel="nofollow">https://volt.fm</a>
My off-topic rant: I'd really wish Spotify would focus on improving the core player experience. It has barely seen any improvements in years.<p>* Not overwrite/delete my listening history everytime I switch devices<p>* Allow tabs, or some way to resume what I've been listening to in different contexts<p>* Option to open only one instance, instead of having multiple instances that mess with each other<p>* Playing local files crashes/not working on Linux<p>* Change playback speed, not just for podcasts<p>* Jump back/forward, not just for podcasts<p>* Have some visibility when the song was last played / play count<p>* Liked songs not always appearing in search results<p>* Sorting search results not working<p>* Add basic functionality to the dbus interface (e.g. seeking)<p>* Ability to report songs (e.g. wrong titles/badly split tracks/etc.)
Great! Now will they stop scanning my entire hard drive with their desktop app? Also stop opening sockets directly to advertiser IP address. And stop paying off data thieves instead of disclosing to their users that their passwords were leaked. And to stop being sellouts too!
One thing I can never understand about Spotify is that despite it's insane budget, huge amount of employees/talent, they still can't create better personalized playlists than either Pandora OR last.fm.<p>To this day when I want a recommended playlist based on my taste/history, I always use last.fm because it's just plain better. Why? The "Discover" etc playlists on Spotify are just crap.
Spotify is a company that feels like they want to be a "big tech company" when in reality they do not need to. All they need to do is provide a great service with as much music as possible.
When this report came out it was the straw that broke the camel's back for me in terms of my data privacy. Most people seem to have found Wrapped 2020 entertaining but I found it creepy.<p>I miss being able to do something simple like listen to music or watch a movie without all my actions being recorded and saved. So I'm back to buying physical media and DRM free downloads.<p>I'm convinced that it is now important to hold on to older appliances that work without internet access or data collection this plus right to repair gives me hope for the future.