Asking HN data engineers for opinions and recommendations:<p>I am building on top of the blockchain data using Google Cloud and Bigquery however the queries for most of the data is just too slow and takes a long time.<p>I believe there is huge improvement in database tools as well as new DBs popping up. Are there any recommendations that I can implement to make my SQL queries much faster?<p>Two that I came across is trino.io and Apache Spark. Anyone used these before and saw noticeable performance changes?
You're probably going to be hard pressed to really beat out the sheer data throughput capacity of BigQuery without significant developer investment. I've done a lot of Apache Spark, Flume and Beam and they all have their own quirks and problems points. All of them are not at interactive speeds, have OOM issues for certain operations and require a lot of glue code/data management.<p>Honestly at this point you're probably going to want to start looking into the query planner and execution stats of your BigQuery jobs to see if you're doing something really weird and optimize that: <a href="https://cloud.google.com/bigquery/docs/query-plan-explanation" rel="nofollow">https://cloud.google.com/bigquery/docs/query-plan-explanatio...</a>
<a href="https://en.wikipedia.org/wiki/Column-oriented_DBMS" rel="nofollow">https://en.wikipedia.org/wiki/Column-oriented_DBMS</a><p>and past that making the "big" data as "small" as you can by removing points that aren't necessary, removing fields that aren't necessary, and representing the remainder with as few bits as possible.