Show HN: SPyQL – SQL with Python in the middle

105 pointsby dmouraover 3 years ago

SPyQL (<a href="https://github.com/dcmoura/spyql" rel="nofollow">https://github.com/dcmoura/spyql</a>) is SQL with Python in the middle, an open-source project fully written in Python for making command-line data processing more intuitive, readable and powerful. Try mixing in the same pot: a SQL SELECT for providing the structure, Python expressions for defining transformations and conditions, the essence of awk as a data-processing language, and the JSON handling capabilities of jq.How does a SPyQL query looks like?<pre><code> $ spyql “ IMPORT pendulum AS p SELECT (p.now() - p.from_timestamp(purchase_ts)).in_days() AS days_ago, sum_agg(price * quantity) AS total FROM csv WHERE department.upper() == 'IT' and purchase_ts is not Null GROUP BY 1 ORDER BY 1 TO json” < my_purchases.csv </code></pre> In a single statement we are 1) reading a CSV (of purchases) with automatic header detection, dialect detection, type inference and casting, 2) filtering out records that do not belong to the IT department or do not have a purchase timestamp 3) summing the total purchases and grouping by how many days ago they happened, 4) sorting from the most to the least recent day and 5) writing the result in JSON format. All this without loading the full dataset into memory.The Readme is loaded with recipes and there is also a demo video: <a href="https://vimeo.com/danielcmoura/spyqldemo" rel="nofollow">https://vimeo.com/danielcmoura/spyqldemo</a>Any feedback is welcomed! Thank you.

16 comments

eatonphilover 3 years ago

Nice work! I have a similar tool built on SQLite [0] and there are a number of similar tools (compared in my readme) like octosql, q, textql, datasette, etc.Implementation-wise, is there a reason you wrote your own SQL parser rather than using an existing library? I ask not because I'd discourage handwritten SQL parsers but that I think the particular approach you took is going to suffer when it comes to error messages and lexical edge cases like `SELECT(SELECT 1)`.Not to mention that it introduces a completely new SQL dialect users would have to learn. (I think even SQLite is a bit risky since most people know MySQL or PostgreSQL dialects.)But I can appreciate you wanting your own syntax especially for nested objects. In dsq I solved that by giving nested fields a literal column name with a dot in it. But that means you have to quote the column name (e.g. `SELECT "city.address" FROM {}`).The additional penalty you'd pay not backing this by SQLite or DuckDB or some other in-memory database is in query performance as the input grows. Among other existing tools like this octosql has its own query engine but cube2222 has put a lot of time into optimizing it and it's also written in Go.Whatever the case, competition is good! Happy to chat more if you want. I love database projects![0] <a href="https://github.com/multiprocessio/dsq" rel="nofollow">https://github.com/multiprocessio/dsq</a>

评论 #30086979 未加载

simonwover 3 years ago

I took a look to see how this works, expecting a super-complex parser/compiler - and it's mainly some straight-forward regular expression replacements! <a href="https://github.com/dcmoura/spyql/blob/4b5ea3ff52875d2c1a544f06328b010720d624ab/spyql/cli.py#L174" rel="nofollow">https://github.com/dcmoura/spyql/blob/4b5ea3ff52875d2c1a544f...</a>I'm very impressed - this is some very neat pragmatic software design.

评论 #30087257 未加载

cube2222over 3 years ago

Nice one! I like the SQL extension to denote the output format ("TO json"), great idea! It's also interesting that you allow calling python methods on values, really neat!If you're interested in tools like this, also make sure to check out OctoSQL[0], which has an extensible plugin architecture and lets you work with both files (JSON, CSV, ...) and databases (PostgreSQL, MySQL, ...) - even use all of them in a single query. It also supports working with streaming data sources, with Kafka available soon.Other than that, check out all the other notable tools in this area: datafusion, datasette, dsq, q, textql.Disclaimer: I'm the author of OctoSQL.

评论 #30087248 未加载

评论 #30089826 未加载

hbarkaover 3 years ago

Interesting hybrid.Today I came across yet another Twitter thread inquiring if you are SQL-ista or Pythonista <a href="https://twitter.com/bennstancil/status/1486031369327292421" rel="nofollow">https://twitter.com/bennstancil/status/1486031369327292421</a>One response probability summed it best: “SQL is written for you to think in shapes while python is written for you to think in steps. SQL lets you focus on defining a shape, and the warehouse figures out how to compute it. That's a better flow & focus for data transformation work.”

评论 #30090952 未加载

BeefWellingtonover 3 years ago

Interesting concept.It's always intriguing to me that no major database vendors have added support for Python, Javascript, Lua, et. al, for their expression and stored procedure definition languages. It would really make things more accessible than Transact-SQL[1], Oracle's P/SQL[2] or DB2's messy external procedure support.[3][1]: <a href="https://docs.microsoft.com/en-us/sql/t-sql/statements/create-procedure-transact-sql?view=sql-server-ver15" rel="nofollow">https://docs.microsoft.com/en-us/sql/t-sql/statements/create...</a>[2]: <a href="https://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_6009.htm" rel="nofollow">https://docs.oracle.com/cd/B19306_01/server.102/b14200/state...</a>[3]: <a href="https://www.ibm.com/docs/en/db2/11.5?topic=statements-create-procedure-external" rel="nofollow">https://www.ibm.com/docs/en/db2/11.5?topic=statements-create...</a>

评论 #30096456 未加载

评论 #30100489 未加载

评论 #30096648 未加载

评论 #30097178 未加载

评论 #30097550 未加载

评论 #30097410 未加载

评论 #30098796 未加载

alin23over 3 years ago

I love this tool!I use it every day to insert monitor data from Lunar (<a href="https://lunar.fyi" rel="nofollow">https://lunar.fyi</a>) into The Monitor Database (<a href="https://db.lunar.fyi" rel="nofollow">https://db.lunar.fyi</a>)I also use it occasionaly to insert data from my Meilisearch instances into Postgres/SQLite for more accurate filtering when needed.

评论 #30086236 未加载

estover 3 years ago

This is like the opposite of<a href="https://github.com/ponyorm/pony/" rel="nofollow">https://github.com/ponyorm/pony/</a><a href="https://macropy3.readthedocs.io/en/latest/pinq.html#pinq" rel="nofollow">https://macropy3.readthedocs.io/en/latest/pinq.html#pinq</a>

评论 #30089416 未加载

mechatronerover 3 years ago

Interesting, I assume SPyQL uses eval internally? Couple of years ago I wrote [RBQL](<a href="https://rbql.org" rel="nofollow">https://rbql.org</a>) which is based on exact same principle and also supports JS expressions, would be curious to compare design deccisions in SPyQL and RBQL.

评论 #30087458 未加载

jp0dover 3 years ago

Sounds awesome. Thanks for posting. It'd be great to be able to import this into a Jupyter notebook or script rather than doing it on the command line. Don't know how useful it is as a command line tool.

评论 #30094388 未加载

clarkfitzgover 3 years ago

I'm happy to see the stdin -> spyql -> stdout model. Plain text is great :)Columnar storage and processing could give you some performance benefits. SPyQL seems like it could potentially work with column text format, which a couple students and I developed last year. <a href="https://cran.r-project.org/web/packages/ctf/vignettes/overview.html" rel="nofollow">https://cran.r-project.org/web/packages/ctf/vignettes/overvi...</a>

评论 #30092767 未加载

asavinovover 3 years ago

Having Python expressions within a declarative language is a really good idea because we can combine low level logic of computations of values with high level logic of set processing.A similar approach is implemented in the Prosto data processing toolkit:<a href="https://github.com/asavinov/prosto" rel="nofollow">https://github.com/asavinov/prosto</a>Although Prosto is viewed as an alternative to Map-Reduce by relying on functions, it also supports Python User-Defined Functions in its Column-SQL:<pre><code> prosto.column_sql( "CALCULATE Sales(quantity, price) -> amount", lambda x: x["quantity"] * x["price"] ) </code></pre> In this Column-SQL statement we define a new calculated column the values of which are computed in Python (as a sum of two columns). An advantage is that we can process data in multiple tables without joins or groupbys which is much easier than in the existing set-oriented approaches. Another advantage is that we can combine many statements by defining a workflow in an Excel-like manner.

评论 #30088878 未加载

mwexlerover 3 years ago

I shouldn't love this frankenstein hybrid, but I have to say, it looks cool.

JetAloneover 3 years ago

This is cool. The environments at my workplace have some YAML's so big they cause the IDE to chug hard, so I think I might be able to find a real use case for this!

评论 #30094424 未加载

abalajiover 3 years ago

The linked demo video is fantastic! I especially loved the demo of SPyQL reading and plotting in numpy data generated from the linspace command.

评论 #30089966 未加载

ddgfloridaover 3 years ago

Nice work. The closest tool I wrote for www.convertcsv.io is CSV Query Tool.

hbarkaover 3 years ago

Do you plan on adding capability to do more DDL and DML?

评论 #30089950 未加载

16 comments

eatonphilover 3 years ago

评论 #30086979 未加载

simonwover 3 years ago

评论 #30087257 未加载

cube2222over 3 years ago

评论 #30087248 未加载

评论 #30089826 未加载

hbarkaover 3 years ago

评论 #30090952 未加载

BeefWellingtonover 3 years ago

评论 #30096456 未加载

评论 #30100489 未加载

评论 #30096648 未加载

评论 #30097178 未加载

评论 #30097550 未加载

评论 #30097410 未加载

评论 #30098796 未加载

alin23over 3 years ago

评论 #30086236 未加载

estover 3 years ago

评论 #30089416 未加载

mechatronerover 3 years ago

评论 #30087458 未加载

jp0dover 3 years ago

评论 #30094388 未加载

clarkfitzgover 3 years ago

评论 #30092767 未加载

asavinovover 3 years ago

评论 #30088878 未加载

mwexlerover 3 years ago

I shouldn't love this frankenstein hybrid, but I have to say, it looks cool.

JetAloneover 3 years ago

This is cool. The environments at my workplace have some YAML's so big they cause the IDE to chug hard, so I think I might be able to find a real use case for this!

评论 #30094424 未加载

abalajiover 3 years ago

The linked demo video is fantastic! I especially loved the demo of SPyQL reading and plotting in numpy data generated from the linspace command.

评论 #30089966 未加载

ddgfloridaover 3 years ago

Nice work. The closest tool I wrote for www.convertcsv.io is CSV Query Tool.

hbarkaover 3 years ago

Do you plan on adding capability to do more DDL and DML?

评论 #30089950 未加载