if Presto SQL can query every database under the Sun (from RDBMS, Hadoop HDFS, Kafka and ElasticSearch) why it has not become industry standard yet? What's the catch? What are its limitation?
I've been using it for years with clients. It tends to sit in between source data and a final destination. Most data platforms are trying to take data from 10s if not 100s of sources and unify them. The variety of formats and sources is endless and I've often had to resort to using Python-based Airflow DAGs to collect data based on dates and store a cleaned up version of what was collected as a timestamped PQ file for Presto. Presto is then great for large scale transformations across the data and ad-hoc exploring.<p>Presto can be pointed at a lot of data stores but few external data providers offer ODBC-like interfaces. It seems to be either APIs or static file dumps for the most part. So Presto isn't going to be able to pull from these datasets alone.<p>In terms of security and maintenance, products like Redshift are easier to train traditional data warehouse people up on. The service is relatively cheap and has a nice UI for scaling.<p>The data world is extremely fragmented. Once firms have something in place changing it is going to be a struggle. Existing staff often gate keep and defend whatever technology they've staked their careers on. Once there are a lot of reports setup with any on data source migrating it could end up becoming a prolonged project which can be hard to sell.<p>It was quoted Snowflake had a $1M / day budget for sales and marketing. I'm not aware of any Presto consultancy spending that sort of money. Amazon does have Athena but they have countless other offerings which muddies the water.
Lots of general reasons, inertia, etc. Often companies stick to 1-2 preferred technologies and adding Presto isn't seen as a gain (even though it helped Facebook quite a bit). I also suspect Amazon re-packing it as Athena reduced adoption to some extent.<p>If looking at Presto (now Trino), the main thing to keep in mind is that you inherit the limitations of the underlying data store.<p>Its best when the underlying store (+ the Db adaptor implementation) lets you parallelize work and keep each node busy, and avoid processing data unnecessarily. Hive/S3 columnar format data works great for this (IIRC this was a major early use case). Other sources like RDBMS will have natural limitations. Kafka has its own issues since each query generally means re-scanning a topic, etc.<p>I see the data bridges as most useful as a way to bring data into the native/optimal format. Then do the heavy lift work in Presto.
This might be outdated info:<p><pre><code> * Two different Prestos, prestodb and prestosql for maximum confusion. (I think one renamed)
* Making Controller highly available by default is hard
* Autoscaling workers is not simple
* Code very dependent on its own webframework that tries to do everything and lacks docs.
* Resource planner for multiple queries is lacking
* Worker configuration takes a lot of skill
</code></pre>
All of these could be solved, but in most cases you can find other solutions where you get a simpler set of problems.
It adds complexity in infrastructure and becomes a separate translation layer. I haven’t run a presto deployment, but have worked adjacent to it, I do think it brings advantages. But a lot of client side ‘whatevers’ connect to a lot of different database/storage technologies already, adding another complex layer to the mix might not be worth it.
Dremio also appears to take a similar approach, but with more advanced caching features / query pushdown. Plus it has Apache Arrow at its heart. I think that would be my choice of solution in this space