Spotify Optimized the Largest Dataflow Job Ever for Wrapped 2020

196 pointsby SirOibafover 4 years ago

24 comments

matsemannover 4 years ago

I came here expecting to read about the tech in the article or how others do big data processing stuff. Instead I get off topic Spotify rants.. Did you read the article or just see Spotify in the headline and decided your gripes therefore are relevant?

评论 #26113353 未加载

评论 #26113608 未加载

评论 #26114728 未加载

评论 #26112988 未加载

BorisTheBraveover 4 years ago

I'm having a hard time understanding this article. It seems to be a bit too low level on the specifics of Beam for general consumption.From what i undestand, Spark has the same feature built in. If the planner knows that the source data is partitioned and/or sorted appropriately, it can skip shuffling/sorting it, instead having each executor directly requesting the one file it needs.It's a nice optimization, but it's not game changing. You often end up having to shuffle anyway, as you are joining on a different key, or for performance reason you need more executors than the set amount of partitions, or the shuffle needed to write the data doesn't justify the savings on the readers.Maybe it's better with their additional optimizations? Spark does not do those, mostly.

评论 #26114339 未加载

评论 #26113637 未加载

polskibusover 4 years ago

I wonder if they could publish dollar cost of that job before and after the optimization, as provided by GCP billing. I know it could be a bit unfair (some costs may be static, regardless of job size, etc.) but it would improve decision making for others if discussions of public cloud usage optimizations also include the cost.

评论 #26113712 未加载

gabagoolover 4 years ago

Regarding the title, how do we know this is THE largest dataflow job? The article body itself only makes mention that this is THEIR largest dataflow job. This post doesn't make any quantifiable claims either one could use to support, this is all I found:> "We estimate around a 50% decrease in Dataflow costs this year compared to previous years’ Bigtable-based approach. Additionally, we avoided scaling the Bigtable cluster up two to three times its normal capacity (up to around 1,500 nodes at peak"The official Spotify Engineering Tweet similarly only makes mention that this is Spotify's largest dataflow job ever: <a href="https://twitter.com/SpotifyEng/status/1359887825047613442" rel="nofollow">https://twitter.com/SpotifyEng/status/1359887825047613442</a>.I'm fairly sure a similar accidental unsourced exaggeration was made last year.Maybe the title should be Spotify Optimized Their Largest Dataflow Job Ever For Wrapped 2020?

max_streeseover 4 years ago

Hi not sure if I am just completely off here but I am wondering how this relates or compares to processing things with Kafka and Kafka Streams?If I am reading things correctly with Kafka the workflow equivalent to what's written in the article would be to have your producer produce via hash-based-round-robin (the default partitioning algorithm) based on the key you are interested in into some topic and then your consumer would just read it and your data would already be sorted for the given keys (because within a partition Kafka has sorting guarantees) and also be co-partitioned correctly if you need to read some other topic in with the same number of partitions and the same logical keys produced via the same algorithm. No?

评论 #26113400 未加载

meiboover 4 years ago

Wrapped works so well because it panders to you. Everyone likes to be acknowledged for listening to the weird indie band they discovered earlier this year.I enjoy it anyways, and Spotify is still a great service for now - I wonder if it'll meet the same fate as Netflix at some point, with publishing houses going for their own streaming services instead.

评论 #26112601 未加载

评论 #26116677 未加载

评论 #26112683 未加载

bltover 4 years ago

This article fails to make a clean problem statement for the general audience. It jumps right into jargon and names from some framework/library. It reads like it was an internal report from a programmer to their team, and someone decided to make it public with no changes.

iamacyborgover 4 years ago

And they’re still less useful than the data last.fm makes available to you.

评论 #26113763 未加载

评论 #26113607 未加载

sailfastover 4 years ago

For those that would like to dig more into what SMB is - here's the link to the paper from the article: <a href="http://kth.diva-portal.org/smash/get/diva2:1334587/FULLTEXT01.pdf" rel="nofollow">http://kth.diva-portal.org/smash/get/diva2:1334587/FULLTEXT0...</a>

philip1209over 4 years ago

I'd be curious how this compares in load to Google's internal applications. I'm also curious what the capacity of Google's infrastructure goes to Google vs. GCE - has combined GCE usage even passed the compute needs of Google internally yet?

评论 #26116064 未加载

henronover 4 years ago

This technique of using distributed storage for large joins instead of shuffling between compute nodes also helps make your job robust to spot instance kills. Until disaggregated shuffle services are widely adopted, it can be really handy.

paulsutterover 4 years ago

Largest data flow job ever? I’m sure Google would beg to differ. At Quantcast we process 50PB every day, and that’s nothing compared to real scale like Google.And merge joins from sorted data? Joins have been done that way since the punched card days on mainframes (and by any scaled data system)

评论 #26114016 未加载

评论 #26113661 未加载

评论 #26114855 未加载

gripfxover 4 years ago

Yet, I, as a user still cannot see my playcounts via app or API.

评论 #26112340 未加载

评论 #26112360 未加载

评论 #26112464 未加载

评论 #26112311 未加载

评论 #26112332 未加载

a254613eover 4 years ago

Perhaps a bit off-topic. But a lot of users (myself included) reported wildly inaccurate data in the spotify wrapped this year with seemingly no explanation, no shared accounts, no re-used passwords, no weird listening history, etc.I wonder if some of the data in the "We worked with the maintainer of these data sets to convert a year’s worth of data to SMB format." step got corrupted or just wrongly converted/lost.I'm not sure how else explain that I have to google artists in my top 10 because I never heard of them.

mouldysammichover 4 years ago

listenbrainz has gotten pretty good in the last while for keeping track of your music stats. Its got a nice weekly stats page etc. Its also not ad laden like last.fm

评论 #26123341 未加载

shoulderfakeover 4 years ago

Whatever theyre doing with data means nothing when their client apps are absolute dogshit.

person_of_colorover 4 years ago

Could be done with a bash script...

评论 #26113660 未加载

spotyoufi462881over 4 years ago

Anyone wanna know how much Spotify wanna know about you?<a href="https://twitter.com/steipete/status/1025024813889478656" rel="nofollow">https://twitter.com/steipete/status/1025024813889478656</a>

评论 #26112949 未加载

评论 #26114213 未加载

评论 #26113952 未加载

评论 #26113343 未加载

soheilproover 4 years ago

Sorry if it's off-topic, but if anyone's interested, I'm launching volt.fm next week.It connects to your Spotify account and generates a nice public page with your stats (top artists, top tracks), playlists and etc.You can reserve your username now: <a href="https://volt.fm" rel="nofollow">https://volt.fm</a>

评论 #26112478 未加载

shaicolemanover 4 years ago

My off-topic rant: I'd really wish Spotify would focus on improving the core player experience. It has barely seen any improvements in years.* Not overwrite/delete my listening history everytime I switch devices* Allow tabs, or some way to resume what I've been listening to in different contexts* Option to open only one instance, instead of having multiple instances that mess with each other* Playing local files crashes/not working on Linux* Change playback speed, not just for podcasts* Jump back/forward, not just for podcasts* Have some visibility when the song was last played / play count* Liked songs not always appearing in search results* Sorting search results not working* Add basic functionality to the dbus interface (e.g. seeking)* Ability to report songs (e.g. wrong titles/badly split tracks/etc.)

评论 #26113741 未加载

评论 #26113923 未加载

评论 #26113918 未加载

评论 #26119287 未加载

greatthxover 4 years ago

Great! Now will they stop scanning my entire hard drive with their desktop app? Also stop opening sockets directly to advertiser IP address. And stop paying off data thieves instead of disclosing to their users that their passwords were leaked. And to stop being sellouts too!

评论 #26112576 未加载

saberienceover 4 years ago

One thing I can never understand about Spotify is that despite it's insane budget, huge amount of employees/talent, they still can't create better personalized playlists than either Pandora OR last.fm.To this day when I want a recommended playlist based on my taste/history, I always use last.fm because it's just plain better. Why? The "Discover" etc playlists on Spotify are just crap.

评论 #26113017 未加载

评论 #26113039 未加载

staticelfover 4 years ago

Spotify is a company that feels like they want to be a "big tech company" when in reality they do not need to. All they need to do is provide a great service with as much music as possible.

评论 #26113071 未加载

评论 #26112405 未加载

评论 #26112573 未加载

评论 #26113820 未加载

评论 #26112661 未加载

评论 #26112604 未加载

评论 #26112402 未加载

评论 #26119662 未加载

fooblatover 4 years ago

When this report came out it was the straw that broke the camel's back for me in terms of my data privacy. Most people seem to have found Wrapped 2020 entertaining but I found it creepy.I miss being able to do something simple like listen to music or watch a movie without all my actions being recorded and saved. So I'm back to buying physical media and DRM free downloads.I'm convinced that it is now important to hold on to older appliances that work without internet access or data collection this plus right to repair gives me hope for the future.

评论 #26112610 未加载

评论 #26112400 未加载

评论 #26112529 未加载

评论 #26112590 未加载