Build your own “data lake” for reporting purposes

143 pointsby Signezabout 4 years ago

11 comments

pupdoggabout 4 years ago

This is an overly complex solution that we were able to resolve using a simple VPS running Clickhouse as backend and Grafana for frontend. Our production db is an Aurora MySQL instance and we keep it lean by performing daily dumps of reporting related data into a CSV with gzip compression -> push it to S3 -> convert it to parquet file format using AWS glue -> bring it into ClickHouse. Data size for these specific reports is approx 100k rows daily and is partitioned by MONTH/YEAR. Overall cost: $20/month VPS and approx. $15/month in AWS billing.

评论 #26470747 未加载

评论 #26464523 未加载

georgewfraserabout 4 years ago

This is a needlessly complex solution. You will get better performance, with simpler maintenance, by replicating everything into an appropriate analytical database (Snowflake and BugQuery are both good choices). Setting up multiple Postgres database and linking them together with foreign data wrappers is interesting to blog about, but it’s an extremely roundabout way to solve this problem.

评论 #26461600 未加载

评论 #26463717 未加载

评论 #26463147 未加载

评论 #26461695 未加载

评论 #26468468 未加载

评论 #26462044 未加载

xyzzy_plughabout 4 years ago

This is mostly what I would do at a small to medium sized startup. Everything seems really sane.There are two potential flaws, though.One is that while using a read replica for reporting works, there be dragons. There is a reasonable amount of tuning possible here. It's important not to conflate the reporting replica with a standby replica for prod. You also might consider asynchronous replication over synchronous replication for the reporting DB. Lastly, there are (tunables) limits to how much you can get away with, but long running queries on either end can potentially cause queries to get cancelled should they hold up replication enough. In other words, it is still very valuable to have optimized, small queries on the reporting DB where possible.Second is that this works fine until your data doesn't fit on an instance. Large RDS instances are not cheap, and at some large table size it begins to make sense to look at warehouse solutions, like Spark on S3/whatever or Redshift or Snowflake, which can scale out to match your data size. I'd be concerned to find a rats-nest of reporting database instances when a proper cluster should be used.

评论 #26461740 未加载

snidaneabout 4 years ago

It seems the article suggests piping around and querying tabular data in postgres. I don't see a need for the data lake part.Data Warehouses are used to store and process anything which looks like a structured table, or nested tables, so called semi-structured data.Data lakes are for everything else. Your SQL warehouse can't process or is not ergonomic for processing of:<pre><code> - purchased data, shared as a zip archive of csv files - 10 level deep nested json files coming from api calls - html files from webscrapes - contents of ftp containing 100s of various csv and other files - array data used for machine learning algorithms - pickled python ml models - yaml configs - pdf documents such as data dictionaries - materialized views over raw data </code></pre> Besides your DWH, you need to have a storage layer, where you store these files and raw data. This is the main reason for why companies have data lake projects. Without some centralised oversight and discipline it only results in a big mess. Note that the centralisation doesn't have to be company-wide, each team or department can maintain their own data lake standard. The more centralised you do it, the more economy of data scale you get, but the harder it gets to enforce and maintain with higher chance of turning into mess again.I think Data Mesh concept is proposing structuring your org as a bunch of data producing teams, each maintaining their own data lake, instead of having one huge ass lake mess in the middle.Tools like delta.io and Databricks are giving data lakes full capabilities of a data warehouse, so the difference between data lake and data warehouse is diminishing. These days you can get away without a dedicated DWH and just store everything in a blob store and plug in short-lived processing engines as you wish without vendor lock-ins.

gizmodo59about 4 years ago

For my home projects I generate parquet (columnar and very well suited for DW like queries) files with pyarrow and use Dremio (Direct SQL on data lake): <a href="https://github.com/dremio/dremio-oss" rel="nofollow">https://github.com/dremio/dremio-oss</a> (<a href="https://www.dremio.com/on-prem/" rel="nofollow">https://www.dremio.com/on-prem/</a>) to query them (minio or just local disk or s3) and use Apache Superset for quick charts or dashboards.

评论 #26469602 未加载

greenbcjabout 4 years ago

This seems more like “build your own data warehouse” than data lake.

评论 #26461276 未加载

评论 #26461369 未加载

评论 #26461272 未加载

throwaway346434about 4 years ago

I feel some of the negative comments miss the point, at least of a way of structuring reporting extracts and presenting them in an easy to maintain way for services: you are signing a contract with the data team and taking on their concerns with an approach like this. Your unit tests fail if you are about to break the contract with a change, and you discover it upfront, rather than after rolling out a new service that changes a definition unexpectedly.

unixheroabout 4 years ago

Thanks for giving back to the community in the form of these meditations on Postgres, real world business scenario analytics and keaaons learned.

0xbadcafebeeabout 4 years ago

> When the team was small and had to grow fast, no one in the tech team took the time to build, or even think of, a data processing platform.Good! This is much better than building something well before you need it. So when do you know you'll need it? By having each team track trends in metrics for their services/work and predicting when limits will be exceeded. (This used to be mandatory back when we had to buy gear a year or two ahead) This catches persistently increasing data sizes as well as other issues. Each team should track their metrics, write a description for what its effect means for the whole product, and forward them to a product owner (along with estimates of when limits will be exceeded).> do try to set up some streaming replication on a dedicated reporting database server for quasi-live data (or set up automated regular dump & restore process if you don't need live data)Both of these are potentially hazardous to performance, security, regulatory requirements, customer contract requirements, etc. Replication can literally halt your production database, as can ETL, and where it goes and how it's managed is just as important as for the production DB. So look at your particular use case and design the solution just as carefully as if you were giving direct access to production.As for the structure of it all, once you start adding outside data, you'll find it's easier to architect your solution to have multiple tiers of extracted data in different pipelines, and to expose them each to analysis independently. You can make all of them eventually land in a dedicated database, but allowing individual analysis allows you to later build purpose-fit solutions for just a subset of the data without having to constantly mutate one "end state" database (which may become vastly more complex than your production database).Btw, the name for this kind of work is called Business Intelligence (<a href="https://en.m.wikipedia.org/wiki/Business_intelligence" rel="nofollow">https://en.m.wikipedia.org/wiki/Business_intelligence</a>)

revskillabout 4 years ago

All of this over-engineering challenge is due to lacking of a good ORM model to deal with sql table/view, good cron job/webhook system to sync data in realtime/batching.I would rather spend months to build a good platform rather than working on a over-complicated and low-level setup like this.

chatmastaabout 4 years ago

It's really cool to see these techniques in the wild, and also feels encouraging to us as we're doing something very similar at Splitgraph [0] to implement our "Data Delivery Network" [1]. Recently we've started calling Splitgraph a "Data Mesh" [2]. As long as we have a plugin [3] for a data source, users can connect external data sources to Splitgraph and make them addressable alongside all the other data on the platform, including versioned snapshots of data called data images. [4] So you can `SELECT FROM namespace/repo:tag` where `tag` can refer to an immutable version of the data, or e.g. `live` to route to route to a live external data source via FDW. So far we have plugins for Snowflake, CSV in S3 buckets, MongoDB, ElasticSearch, Postgres, and a few others, like Socrata data portals (which we use to index 40k open public datasets).Our goal with Splitgraph is to provide a single interface to query and discover data. Our product integrates the discovery layer (a data catalog) with the query layer (a Postgres compatible proxy to data sources, aka a "data mesh" or perhaps "data lake"). This way, we improve both the catalog and the access layer in ways that would be difficult or impossible as separate products. The catalog can index live data without "drift" problems. And since the query layer is a Postgres-compatible proxy, we can apply data governance rules at query time that the user defines in the web catalog (e.g. sharing data, access control, column masking, query whitelisting, rewriting, rate limiting, auditing, firewalling, etc.).We like to use GitLab's strategy as an analogy. GitLab may not have the best CI, the best source control, the best Kubernetes deploy orchestration, but by integrating them all together in one platform, they have a multiplicative effect on the platform itself. We think the same logic can apply to the data stack. In our vision of the world, a "data mesh" integrated with a "data catalog" can augment or eventually replace various complicated ETL and warehousing workflows.P.S. We're hiring immediately for all-remote Senior Software Engineer positions, frontend and backend [5].P.P.S. We also have a private beta program where we can deploy a full Splitgraph stack onto either self-hosted or managed infrastructure. If you want that, get in touch. We'll probably be in beta for 12-18 months.[0] <a href="https://www.splitgraph.com" rel="nofollow">https://www.splitgraph.com</a>[1] We talked about all this in depth on a podcast: <a href="https://softwareengineeringdaily.com/2020/11/06/splitgraph-d" rel="nofollow">https://softwareengineeringdaily.com/2020/11/06/splitgraph-d</a>...[2] <a href="https://martinfowler.com/articles/data-monolith-to-mesh.html" rel="nofollow">https://martinfowler.com/articles/data-monolith-to-mesh.html</a>[3] <a href="https://www.splitgraph.com/blog/foreign-data-wrappers" rel="nofollow">https://www.splitgraph.com/blog/foreign-data-wrappers</a>[4] <a href="https://www.splitgraph.com/docs/concepts/images" rel="nofollow">https://www.splitgraph.com/docs/concepts/images</a>[5] Job posting: <a href="https://www.notion.so/splitgraph/Splitgraph-is-Hiring-25b421" rel="nofollow">https://www.notion.so/splitgraph/Splitgraph-is-Hiring-25b421</a>...

评论 #26462749 未加载