The parser.py [1] has only 1.6k lines. And it is hand-written parser. This size is amazing if it's really capable, but I intuitively doubt it. For example, duckdb's select.y [2] has 3700 lines, and this is only for SELECT. ZetaSQL's grammar file [3] is almost 10k lines.<p>The SQL is a monstrous language. Is there any trick that keeps the code simple?<p>[1] <a href="https://github.com/tobymao/sqlglot/blob/main/sqlglot/parser.py" rel="nofollow">https://github.com/tobymao/sqlglot/blob/main/sqlglot/parser....</a>
[2] <a href="https://github.com/duckdb/duckdb/blob/master/third_party/libpg_query/grammar/statements/select.y" rel="nofollow">https://github.com/duckdb/duckdb/blob/master/third_party/lib...</a>
[3] <a href="https://github.com/google/zetasql/blob/master/zetasql/parser/bison_parser.y" rel="nofollow">https://github.com/google/zetasql/blob/master/zetasql/parser...</a>
Author here, feel free to ask me any questions!<p>Something that I'm working on is a pure python SQL engine <a href="https://github.com/tobymao/sqlglot/blob/main/sqlglot/executor/python.py" rel="nofollow">https://github.com/tobymao/sqlglot/blob/main/sqlglot/executo...</a>. It does the whole shebang, parsing, optimizations, logical planning, physical execution.
SQLGlot is great. We've used it to extend our FOSS probabilistic data linking library[1] so that it is now capable of executing against a variety of SQL backends (Spark, Presto, DuckDB, Sqlite), significantly widening our potential user base.<p>We implement the core statistical model in SQL, and then use SQLGlot to transpile to the target execution engine. One big motivation was to futureproof our work - we're no longer tied down to Spark, and so when the 'next big thing' (GPU accelerated SQL for analytics?) comes along, it should be relatively straightforward to support it by writing another adaptor.<p>Working on this has highlighted some of the really tricky problems associated with translating between SQL engines, and we haven't hit any major problems, so kudos to the author!<p>[1] <a href="https://github.com/moj-analytical-services/splink/tree/splink3" rel="nofollow">https://github.com/moj-analytical-services/splink/tree/splin...</a>
Neat! I did an exploration of sql parsers in different languages [0] and couldn't find much for python. But between this project itself and the couple it lists in the benchmarks I have a few more to look at.<p>[0] <a href="https://datastation.multiprocess.io/blog/2022-04-11-sql-parsers.html" rel="nofollow">https://datastation.multiprocess.io/blog/2022-04-11-sql-pars...</a>