TechEcho

12 comments

bradheover 6 years ago

I've been following Uber's big data platform engineering for a while, this is a really interesting update. Specifically, it's interesting how well their Gen 3 stack held up. Also interesting choice to solve the incremental update problem at storage time instead of inserting another upstream ETL process (which would be incredibly expensive at this level of scale I'm sure).Also interesting: A lot of companies, you look at their big data ecosystem and it's littered with tons of tools. Uber seems like they've always done a good job keeping that pared down, which indicates to me that their team knows what they're doing for sure.

评论 #18246746 未加载

评论 #18244812 未加载

vazambover 6 years ago

I find it interesting that one of their major pain points was data schema. After having worked at places that use plain json and places that used protobuf I can highly recommend anyone starting an even mildly complex data engineering project (complexity in data or number of stakeholders) to use something like protobuf, apache arrow or a columnar format if you need it.Having a clearly defined schema that can be shared between teams (we had a specific repo for all protobuf definitions with enforced pull requests) significantly reduces the amount of headaches down the road.

Invictus0over 6 years ago

I was wondering how Uber could possibly need 100PB of space; but if you consider that they've served roughly 10 billion rides, it actually only comes out to roughly 100 kilobytes per ride.

评论 #18245502 未加载

评论 #18245688 未加载

评论 #18245734 未加载

评论 #18246595 未加载

评论 #18245016 未加载

manish_gillover 6 years ago

Good post. The Snapshot-based approach during ingestion time was the part where I couldn't figure out why it was considered a good decision during implementation?I've experimented with Parquet data on S3 for a work POC, and the latency to fetch the data/create tables/run the Spark-SQL query (running on EMR cluster) was quite noticeable. I was advised that EMR-FS would make it run quicker, but never got around to playing with that. But I guess the creating of in-memory tables using raw data snapshots would still remain true? Or maybe I missed something.Also, I take it if 24 hrs is the latency requirements for ingestion to availability of this data, obviously this isn't the data platform that is powering the real time booking/sharing of Uber rides. I'd be curious to see what is the data pipeline that powers that for Uber.

georgewfraserover 6 years ago

Reading this, I can’t help but think Uber would be better off adopting one of the commercial data warehouses that separates compute from storage: Snowflake or BigQuery. They have full support for updates, they support huge scale, and because they’re more efficient the cost is comparable to Presto in spite of the margin. You can ingest huge quantities of updates if you batch them up correctly, and there are commercial tools that will do the entire ingest for you (cough Fivetran).Disclosure: am CEO of Fivetran.

评论 #18245567 未加载

评论 #18245638 未加载

评论 #18245607 未加载

评论 #18245651 未加载

burembaover 6 years ago

I wonder which BI tools they use for running ad-hoc queries on their Presto cluster. The user behavioral analytics is a hassle when you use SQL and generic BI solutions don't help with that.Also, I assume that they have dashboards that use pre-aggregated tables for faster results, they probably have ETL jobs for this use-case but is the pre-aggregated data stored on HDFS as well?

philip1209over 6 years ago

After their data problem exceeded a single MySQL instance - hypothetically, what would have happened if they switched to Google Cloud Spanner? Ostensibly Google has a lot more than 100 petabytes in spanner. Could you still run basic queries in it without switching to hbase?

评论 #18249917 未加载

faizshahover 6 years ago

Where exactly does Flink/AthenaX fall into Uber's stack?

tomnipotentover 6 years ago

Hudi sounds a lot like event sourcing but at data warehousing scale - a change log backed up by a snapshot of the latest updates.

saganusover 6 years ago

Interesting that they use the term "driver-partner" in some parts but just "driver" in others.I guess they want to avoid liability as much as possible?Would it really be possible to use a blog post in a legal proceeding to determine whether Uber has drivers or partners?

评论 #18245506 未加载

评论 #18245972 未加载

seoanalyzer28over 6 years ago

amazing...

Iwan-Zotowover 6 years ago

What exactly is "Big" here? It is about 1000 hard drives, several racks...

评论 #18244935 未加载

评论 #18245013 未加载

评论 #18244750 未加载

评论 #18244734 未加载

评论 #18244791 未加载

12 comments

bradheover 6 years ago

评论 #18246746 未加载

评论 #18244812 未加载

vazambover 6 years ago

Invictus0over 6 years ago

I was wondering how Uber could possibly need 100PB of space; but if you consider that they've served roughly 10 billion rides, it actually only comes out to roughly 100 kilobytes per ride.

评论 #18245502 未加载

评论 #18245688 未加载

评论 #18245734 未加载

评论 #18246595 未加载

评论 #18245016 未加载

manish_gillover 6 years ago

georgewfraserover 6 years ago

评论 #18245567 未加载

评论 #18245638 未加载

评论 #18245607 未加载

评论 #18245651 未加载

burembaover 6 years ago

philip1209over 6 years ago

评论 #18249917 未加载

faizshahover 6 years ago

Where exactly does Flink/AthenaX fall into Uber's stack?

tomnipotentover 6 years ago

Hudi sounds a lot like event sourcing but at data warehousing scale - a change log backed up by a snapshot of the latest updates.

Uber’s Big Data Platform: 100+ Petabytes with Minute Latency

12 comments

Uber’s Big Data Platform: 100+ Petabytes with Minute Latency

12 comments