Data Wrangling at Slack

118 点作者 dianamp超过 8 年前

15 条评论

ransom1538超过 8 年前

For what it is worth, every company I have worked for - and almost every company I know -builds their own bizarre stats system. Each presentation I attend (last one being uber) the ideas for storing columnar data gets even nuttier. Frankly I gave up. Now I just installed new relic insights and I can run queries, have dashboards, and infinite scale. I understand that slack has scale - but why on earth hook together 30 random technologies and become an analytics company too.

评论 #13129955 未加载

评论 #13130306 未加载

评论 #13130539 未加载

评论 #13130590 未加载

coldcode超过 8 年前

Sometimes looking at people's stacks I wonder if we've made computing so complicated most of the time is spent dealing with stuff that is broken, and little time is left to do anything useful. Data science seems even more into this that programming in general; and sometimes you wonder if the result is actually worth all the pain.

评论 #13130422 未加载

评论 #13135828 未加载

bhntr3超过 8 年前

Seems like a pretty typical set of problems. Dependency conflicts hard. Schema evolution hard. Upgrades hard.The big data space still feels like an overengineered, fractured, buggy mess to me. I was hoping spark would simplify the user experience but it's as much of a clusterf*ck as anything else.How hard can fast, reliable distributed computation and storage for petabytes of data be? He said ironically.

评论 #13130664 未加载

评论 #13129613 未加载

buremba超过 8 年前

We actually have pretty similar architecture and use Presto for ad-hoc analysis, Avro is used for hot data and ORC is used as columnar storage at <a href="https://rakam.io" rel="nofollow">https://rakam.io</a>. Similar to Slack, we have append-only schema (stored on Mysql instead of Hive), since Avro has field ordering the parser uses the latest schema and if it gets EOF in the middle of the buffer, fills the unread columns as null. We modified the Presto engine and built a real-time data warehouse, Avro is used when pushing data to Kafka, the consumers fetch the data in micro-batches, process and convert it to ORC format and save it to the both local SSD + AWS S3.

评论 #13130141 未加载

zaptheimpaler超过 8 年前

I had very similar experience with Parquet and cross system pains. Pretty much the whole big data space is a giant cluster fuck of poorly documented and ever so slightly incompatible technologies.. with hidden config flags you need to find to get it to work the way you want, classpath issues, tiny incompatibilities between data storage formats and SQL dialects and so on..Hoping someone on this thread could answer a related question - how do you store data in Parquet when the schema is not known ahead of time? Currently we create an RDD and use Spark to save as Parquet (which I believe has an encoder/decoder for Rows) but this is a problem because we can't stream each record as it comes and use a lot of memory to buffer before writing to disk.

mastratton3超过 8 年前

We're actually having a debate now as we're starting to process larger datasets as to whether or not we should keep everything on S3 or start using HDFS w/ Hive. I'm curious if you guys considered HDFS and why you decided to go strictly with S3, and additionally, are there any issues you encounter with S3.

评论 #13128417 未加载

评论 #13128709 未加载

评论 #13129147 未加载

vikiomega9超过 8 年前

I'm curious about how much time is spent moving data back and forth from S3. It sounds like they don't currently have an ETL per say.

评论 #13130736 未加载

Plough_Jogger超过 8 年前

We are implementing a very similar architecture, and have decided to use Avro for schema validation / serialization, rather than Parquet.Does anyone have experience with both that can talk to their strengths / weaknesses?

评论 #13129860 未加载

评论 #13129049 未加载

评论 #13129883 未加载

eng_monkey超过 8 年前

Data Engineering is about developing technology for data management. Data management/analysis is about using this technology to produce results.So this is not about data engineering, but data management/analysis.

dangoldin超过 8 年前

We (adtech) use a very similar approach. We're consuming a ton of data through Kafka and then using Secor to store it on S3 as Parquet files. We then use Spark for both aggregations as well as ad-hoc analyses.One thing that sounds very interesting and worked surprisingly well when I played around with it was Amazon's Athena (<a href="https://aws.amazon.com/athena/" rel="nofollow">https://aws.amazon.com/athena/</a>) which lets you query Parquet data directly without relying on Spark which can get expensive quickly. I wouldn't trust production use cases just yet and it ties you more and more into the AWS ecosystem but might be worth exploring as a simple way to do basic queries on top of Parquet data. I suspect it's simply a managed service on top of Apache Drill (<a href="https://drill.apache.org/" rel="nofollow">https://drill.apache.org/</a>).

评论 #13128715 未加载

v0g0n超过 8 年前

With Qubole you can offload data engineering to their platform. Cluster management is super simple. Hand rolled solutions in my experience are a pain and elastic cloud features take up time to build. Qubole's offering provides out of the box experience for most big data engines out there. Presto/ Spark/ Hive/ Pig - what have you - all work with your data living in S3 (or any other object storage). I believe they have offerings in other clouds too.Some amount of S3 listing optimisation is done by Qubole's engineering team for: <a href="https://www.qubole.com/blog/product/optimizing-s3-bulk-listings-for-performant-hive-queries/" rel="nofollow">https://www.qubole.com/blog/product/optimizing-s3-bulk-listi...</a>They also have features that allow you to auto-provision for additional capacity in your compute clusters as your query processing times increase.

评论 #13129165 未加载

poorman超过 8 年前

Apparently the concept on sampling has been lost in time.

评论 #13146706 未加载

henrygrew超过 8 年前

Isn't moving data back and forth from s3 rather expensive?

评论 #13128571 未加载

评论 #13128600 未加载

OskarS超过 8 年前

This is off-topic, but I can't help myself:Slack, Hive, Presto, Spark, Sqooper, Kafka, Secor, Thrift, Parquet.I sometimes can't tell the difference between real Silicon Valley product names and parodies. I'm starting to miss the days when it was all just letters and numbers.

评论 #13129863 未加载

评论 #13129802 未加载

评论 #13129694 未加载

vs2370超过 8 年前

Well for its worth my experience interviewing for the data team there was terrible. A long coding exercise that when submitted resulted in a 7 day wait and a 2 liner email. Wouldn't recommend.

评论 #13130269 未加载