SPyQL (<a href="https://github.com/dcmoura/spyql" rel="nofollow">https://github.com/dcmoura/spyql</a>) is SQL with Python in the middle, an open-source project fully written in Python for making command-line data processing more intuitive, readable and powerful. Try mixing in the same pot: a SQL SELECT for providing the structure, Python expressions for defining transformations and conditions, the essence of awk as a data-processing language, and the JSON handling capabilities of jq.<p>How does a SPyQL query looks like?<p><pre><code> $ spyql “
IMPORT pendulum AS p
SELECT
(p.now() - p.from_timestamp(purchase_ts)).in_days() AS days_ago,
sum_agg(price * quantity) AS total
FROM csv
WHERE department.upper() == 'IT' and purchase_ts is not Null
GROUP BY 1
ORDER BY 1
TO json” < my_purchases.csv
</code></pre>
In a single statement we are 1) reading a CSV (of purchases) with automatic header detection, dialect detection, type inference and casting, 2) filtering out records that do not belong to the IT department or do not have a purchase timestamp 3) summing the total purchases and grouping by how many days ago they happened, 4) sorting from the most to the least recent day and 5) writing the result in JSON format. All this without loading the full dataset into memory.<p>The Readme is loaded with recipes and there is also a demo video: <a href="https://vimeo.com/danielcmoura/spyqldemo" rel="nofollow">https://vimeo.com/danielcmoura/spyqldemo</a><p>Any feedback is welcomed! Thank you.
Nice work! I have a similar tool built on SQLite [0] and there are a number of similar tools (compared in my readme) like octosql, q, textql, datasette, etc.<p>Implementation-wise, is there a reason you wrote your own SQL parser rather than using an existing library? I ask not because I'd discourage handwritten SQL parsers but that I think the particular approach you took is going to suffer when it comes to error messages and lexical edge cases like `SELECT(SELECT 1)`.<p>Not to mention that it introduces a completely new SQL dialect users would have to learn. (I think even SQLite is a bit risky since most people know MySQL or PostgreSQL dialects.)<p>But I can appreciate you wanting your own syntax especially for nested objects. In dsq I solved that by giving nested fields a literal column name with a dot in it. But that means you have to quote the column name (e.g. `SELECT "city.address" FROM {}`).<p>The additional penalty you'd pay not backing this by SQLite or DuckDB or some other in-memory database is in query performance as the input grows. Among other existing tools like this octosql has its own query engine but cube2222 has put a lot of time into optimizing it and it's also written in Go.<p>Whatever the case, competition is good! Happy to chat more if you want. I love database projects!<p>[0] <a href="https://github.com/multiprocessio/dsq" rel="nofollow">https://github.com/multiprocessio/dsq</a>
I took a look to see how this works, expecting a super-complex parser/compiler - and it's mainly some straight-forward regular expression replacements! <a href="https://github.com/dcmoura/spyql/blob/4b5ea3ff52875d2c1a544f06328b010720d624ab/spyql/cli.py#L174" rel="nofollow">https://github.com/dcmoura/spyql/blob/4b5ea3ff52875d2c1a544f...</a><p>I'm very impressed - this is some very neat pragmatic software design.
Nice one! I like the SQL extension to denote the output format ("TO json"), great idea! It's also interesting that you allow calling python methods on values, really neat!<p>If you're interested in tools like this, also make sure to check out OctoSQL[0], which has an extensible plugin architecture and lets you work with both files (JSON, CSV, ...) and databases (PostgreSQL, MySQL, ...) - even use all of them in a single query.
It also supports working with streaming data sources, with Kafka available soon.<p>Other than that, check out all the other notable tools in this area: datafusion, datasette, dsq, q, textql.<p>Disclaimer: I'm the author of OctoSQL.
Interesting hybrid.<p>Today I came across yet another Twitter thread inquiring if you are SQL-ista or Pythonista <a href="https://twitter.com/bennstancil/status/1486031369327292421" rel="nofollow">https://twitter.com/bennstancil/status/1486031369327292421</a><p>One response probability summed it best:
“SQL is written for you to think in <i>shapes</i> while python is written for you to think in <i>steps</i>. SQL lets you focus on defining a shape, and the warehouse figures out how to compute it. That's a better flow & focus for data transformation work.”
Interesting concept.<p>It's always intriguing to me that no major database vendors have added support for Python, Javascript, Lua, et. al, for their expression and stored procedure definition languages. It would really make things more accessible than Transact-SQL[1], Oracle's P/SQL[2] or DB2's messy external procedure support.[3]<p>[1]: <a href="https://docs.microsoft.com/en-us/sql/t-sql/statements/create-procedure-transact-sql?view=sql-server-ver15" rel="nofollow">https://docs.microsoft.com/en-us/sql/t-sql/statements/create...</a><p>[2]: <a href="https://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_6009.htm" rel="nofollow">https://docs.oracle.com/cd/B19306_01/server.102/b14200/state...</a><p>[3]: <a href="https://www.ibm.com/docs/en/db2/11.5?topic=statements-create-procedure-external" rel="nofollow">https://www.ibm.com/docs/en/db2/11.5?topic=statements-create...</a>
I love this tool!<p>I use it every day to insert monitor data from Lunar (<a href="https://lunar.fyi" rel="nofollow">https://lunar.fyi</a>) into The Monitor Database (<a href="https://db.lunar.fyi" rel="nofollow">https://db.lunar.fyi</a>)<p>I also use it occasionaly to insert data from my Meilisearch instances into Postgres/SQLite for more accurate filtering when needed.
This is like the opposite of<p><a href="https://github.com/ponyorm/pony/" rel="nofollow">https://github.com/ponyorm/pony/</a><p><a href="https://macropy3.readthedocs.io/en/latest/pinq.html#pinq" rel="nofollow">https://macropy3.readthedocs.io/en/latest/pinq.html#pinq</a>
Interesting, I assume SPyQL uses eval internally? Couple of years ago I wrote [RBQL](<a href="https://rbql.org" rel="nofollow">https://rbql.org</a>) which is based on exact same principle and also supports JS expressions, would be curious to compare design deccisions in SPyQL and RBQL.
Sounds awesome. Thanks for posting. It'd be great to be able to import this into a Jupyter notebook or script rather than doing it on the command line. Don't know how useful it is as a command line tool.
I'm happy to see the stdin -> spyql -> stdout model. Plain text is great :)<p>Columnar storage and processing could give you some performance benefits. SPyQL seems like it could potentially work with column text format, which a couple students and I developed last year. <a href="https://cran.r-project.org/web/packages/ctf/vignettes/overview.html" rel="nofollow">https://cran.r-project.org/web/packages/ctf/vignettes/overvi...</a>
Having Python expressions within a declarative language is a really good idea because we can combine low level logic of computations of values with high level logic of set processing.<p>A similar approach is implemented in the Prosto data processing toolkit:<p><a href="https://github.com/asavinov/prosto" rel="nofollow">https://github.com/asavinov/prosto</a><p>Although Prosto is viewed as an alternative to Map-Reduce by relying on <i>functions</i>, it also supports Python User-Defined Functions in its Column-SQL:<p><pre><code> prosto.column_sql(
"CALCULATE Sales(quantity, price) -> amount",
lambda x: x["quantity"] * x["price"]
)
</code></pre>
In this Column-SQL statement we define a new calculated column the values of which are computed in Python (as a sum of two columns). An advantage is that we can process data in multiple tables without joins or groupbys which is much easier than in the existing set-oriented approaches. Another advantage is that we can combine many statements by defining a workflow in an Excel-like manner.
This is cool. The environments at my workplace have some YAML's so big they cause the IDE to chug hard, so I think I might be able to find a real use case for this!