I wrote a SQL engine in Python

282 pointsby marsupialtail_2over 2 years ago

19 comments

sakrasover 2 years ago

I think the most interesting part of this project is the fault tolerance. I can’t say I’ve seen any other projects do this, but it seems reasonable to want checkpointing during a long computation.Another thing I like is that conceptually it seems like it would be simple to switch the underlying query engine (right now it’s Polars) in the future. Seems like a pretty general distributed system.

评论 #34189737 未加载

评论 #34191635 未加载

评论 #34189679 未加载

int3over 2 years ago

I don't have very much background in ML or distributed systems, so forgive my naive questions...> After all, most ML in industry today seems to be lightweight models applied to heavily engineered featuresI assume "lightweight models" are those that don't have too many parameters, and "heavily engineered features" mean that the data fed into the model has undergone significant pre-processing via potentially complicated UDFs -- hence the motivation for the project. Is that right?> Quokka is an open-source push-based vectorized query engine ... it is meant to be much more performant than blocking-shuffle based alternatives like SparkSQLDoes anyone have pointers to what push-based vs blocking-shuffle engines are? Any good papers?> It should work on local machine no problem (and should be a lot faster than Pandas!)So I understand why Quokka is faster than Spark, but I'm a bit uncertain as to why the author is also making a comparison with Pandas on a single machine. Is it because the streaming pipeline design means that Quokka can better take advantage of multiple cores?

评论 #34194029 未加载

totalhackover 2 years ago

Sorry if I missed it -- Are there plans to offer a way to query this in actual sql? I believe SingleStore is MySQL compatible for example which I think is a nice feature. Basically I want to be able to interact with this much like I'd interact with another database I'm using or perhaps with a sqlalchemy core integration (which both SingleStore and Snowflake have).

评论 #34191853 未加载

theLiminatorover 2 years ago

Can you explain how this might differ from something like <a href="https://github.com/apache/arrow-ballista">https://github.com/apache/arrow-ballista</a>I've seen several variants of "next-gen" spark, but nowhere have I really seen the different tradeoffs/advantages/disadvantages between them.

评论 #34190260 未加载

joocerover 2 years ago

Thanks for sharing.I have a SQL Engine in Python too (<a href="https://github.com/mabel-dev/opteryx">https://github.com/mabel-dev/opteryx</a>). I focused my initial effort on supporting SQL statements and making the usage feel like a database - that probably reflects the problem I had in front of me when I set out - only handling handfuls of gigabytes in a batch environment for ETLs with a group of new-to-data-engineering engineers. Have recently started looking more at real-time performance, such as distributing work. Am interesting in how you've approached.

评论 #34221412 未加载

cpardover 2 years ago

Trino can be fault tolerant but you have to explicitly enable fault tolerant execution.It might be worth running your benchmarks against Trino with fault tolerant execution mode enabled. Check the documentation here: <a href="https://trino.io/docs/current/admin/fault-tolerant-execution.html" rel="nofollow">https://trino.io/docs/current/admin/fault-tolerant-execution...</a>Adding fault tolerant to execution to Trino was a big and complicated project for anyone interested in more details check here: <a href="https://trino.io/blog/2022/05/05/tardigrade-launch.html" rel="nofollow">https://trino.io/blog/2022/05/05/tardigrade-launch.html</a>

评论 #34192659 未加载

nivekkevinover 2 years ago

One significant disadvantage of PySpark is its reliance on py4j to serialize and deserialize objects between Java and Python when using Python UDFs. This constant overhead can become burdensome as data volume increases in such an exchange. However, I am glad to see efforts to create a data pipeline framework using Python and Ray.~One suggestion, a Scala/Java Spark run of those benchmarks should be a valid baseline to compare against as well instead of PySpark.~ Ah it's SparkSQL so the execution probably wouldn't have much of py4j involvement, except for the collect.

评论 #34192959 未加载

评论 #34192786 未加载

PontifexMinimusover 2 years ago

Funnily enough I'm currently writing a NoSQL database in Python.

评论 #34190901 未加载

评论 #34190840 未加载

einpoklumover 2 years ago

> A library to parse and optimize SQL,That's like saying "a library to parse and optimize computer programs", except probably even harder, since a compiler and runtime library can't make any assumptions about the programs they need to make run, so they're limited in the potential of utilizing all that context information.Countless person-years have been spent on this and it's still a very active fields of research and engineering.> 2x SparkSQL performanceAh, ok, so it can be slow. Never mind then, carry on :-P

评论 #34190848 未加载

评论 #34190658 未加载

eatonphilover 2 years ago

The core of it is Rust:> Very fast kernels for SQL primitives like joins, filtering and aggregations. Quokka uses Polars to implement these. (I sponsor Polars on Github and you should too.) I am also exploring DuckDB, but I have found Polars to be faster so far.

评论 #34193303 未加载

评论 #34191855 未加载

评论 #34192063 未加载

akdor1154over 2 years ago

I'm interested in this just for the use case of 'a sql frontend for polars' - I wonder if just that part could be used independently?

评论 #34194663 未加载

评论 #34192666 未加载

polskibusover 2 years ago

I have previously used Dask to handle larger data sets, and that made me wonder - how does Quokka compare to Dask?

samsquireover 2 years ago

I wrote a toy distributed SQL, cypher graph, dynamodb style and document storage Python database but it's more for experimentation than serious use. It's not ready for use it's more a show of how little code you can use to write a database.<a href="https://GitHub.com/samsquire/hash-db">https://GitHub.com/samsquire/hash-db</a>

评论 #34192574 未加载

评论 #34193218 未加载

cogsboxover 2 years ago

I like good sql projects like this. I use duckdb for almost everything.

brunobowdenover 2 years ago

Python interpreter in SQL - the reverse of what was done - would've been really impressive. Terrible idea of course but impressive nonetheless.

评论 #34190582 未加载

zitterbewegungover 2 years ago

Have you tested this with Jepsen?

评论 #34190899 未加载

imiricover 2 years ago

I haven't looked into this in detail, and it seems like a fine project at a glance, but this caught my attention from the introduction:> When I set out, I had several objectives:> Easy to install and run, especially for distributed deployments.> [...]> The first two objectives strongly scream Python as the language of choice for Quokka.Python is probably one of the last languages I'd consider if ease of deployment is a priority. Packaging has historically been a mess, and deploying standalone binaries across platforms is a pain. State of the art solutions are 3rd party and involve bundling the intepreter for each platform. It's been a few years since I last used it for anything serious, but I believe this is still the case.Whereas something like Go actually makes this infinitely easier, for both the developer and the user. One native Go command builds a standalone binary for each platform. It couldn't be simpler.The other objective of supporting Python UDFs necessarily ties you to Python. And since this is solving a data science problem, it makes sense for it to be written in Python.

评论 #34191864 未加载

评论 #34191883 未加载

ris58hover 2 years ago

Isn't there should be Show HN in the title?

评论 #34193826 未加载

throwaway290over 2 years ago

> Having lost all the money I made from my startup on shitcoins and the stock market, I returned to my PhD program to build a better distributed query engine, QuokkaCasual README slip of the year...

评论 #34194096 未加载

评论 #34194104 未加载