I've been following Uber's big data platform engineering for a while, this is a really interesting update. Specifically, it's interesting how well their Gen 3 stack held up. Also interesting choice to solve the incremental update problem at storage time instead of inserting another upstream ETL process (which would be incredibly expensive at this level of scale I'm sure).<p>Also interesting: A lot of companies, you look at their big data ecosystem and it's littered with tons of tools. Uber seems like they've always done a good job keeping that pared down, which indicates to me that their team knows what they're doing for sure.
I find it interesting that one of their major pain points was data schema. After having worked at places that use plain json and places that used protobuf I can highly recommend anyone starting an even mildly complex data engineering project (complexity in data or number of stakeholders) to use something like protobuf, apache arrow or a columnar format if you need it.<p>Having a clearly defined schema that can be shared between teams (we had a specific repo for all protobuf definitions with enforced pull requests) significantly reduces the amount of headaches down the road.
I was wondering how Uber could possibly need 100PB of space; but if you consider that they've served roughly 10 billion rides, it actually only comes out to roughly 100 kilobytes per ride.
Good post. The Snapshot-based approach during ingestion time was the part where I couldn't figure out why it was considered a good decision during implementation?<p>I've experimented with Parquet data on S3 for a work POC, and the latency to fetch the data/create tables/run the Spark-SQL query (running on EMR cluster) was quite noticeable. I was advised that EMR-FS would make it run quicker, but never got around to playing with that. But I guess the creating of in-memory tables using raw data snapshots would still remain true? Or maybe I missed something.<p>Also, I take it if 24 hrs is the latency requirements for ingestion to availability of this data, obviously this isn't the data platform that is powering the real time booking/sharing of Uber rides. I'd be curious to see what is the data pipeline that powers that for Uber.
Reading this, I can’t help but think Uber would be better off adopting one of the commercial data warehouses that separates compute from storage: Snowflake or BigQuery. They have full support for updates, they support huge scale, and because they’re more efficient the cost is comparable to Presto in spite of the margin. You can ingest huge quantities of updates if you batch them up correctly, and there are commercial tools that will do the entire ingest for you (<i>cough</i> Fivetran).<p>Disclosure: am CEO of Fivetran.
I wonder which BI tools they use for running ad-hoc queries on their Presto cluster. The user behavioral analytics is a hassle when you use SQL and generic BI solutions don't help with that.<p>Also, I assume that they have dashboards that use pre-aggregated tables for faster results, they probably have ETL jobs for this use-case but is the pre-aggregated data stored on HDFS as well?
After their data problem exceeded a single MySQL instance - hypothetically, what would have happened if they switched to Google Cloud Spanner? Ostensibly Google has a lot more than 100 petabytes in spanner. Could you still run basic queries in it without switching to hbase?
Interesting that they use the term "driver-partner" in some parts but just "driver" in others.<p>I guess they want to avoid liability as much as possible?<p>Would it really be possible to use a blog post in a legal proceeding to determine whether Uber has drivers or partners?