TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Command-line data analytics

103 pointsby dmouraover 2 years ago

11 comments

avogarover 2 years ago
SPyQL looks very promising, great work!<p>I can&#x27;t help but mention clickhouse-local tool: <a href="https:&#x2F;&#x2F;clickhouse.com&#x2F;docs&#x2F;en&#x2F;operations&#x2F;utilities&#x2F;clickhouse-local" rel="nofollow">https:&#x2F;&#x2F;clickhouse.com&#x2F;docs&#x2F;en&#x2F;operations&#x2F;utilities&#x2F;clickhou...</a><p>clickhouse-local is a single binary that enables you to perform fast data processing using SQL - effectively database features without a database. This tool supports the full breadth of ClickHouse functions, many popular file formats and recently added automatic schema inference. You can query not only local files, but also remote files (from S3&#x2F;HDFS&#x2F;static files accessed by URL). Moreover, clickhouse-local tool has interactive mode where you can create tables, play with data and do almost everything that you can do wih ordinary database. And let&#x27;s not forget, this tool is written in C++, so it&#x27;s incredibly fast.<p>Disclaimer: Work at ClickHouse
评论 #33453365 未加载
cube2222over 2 years ago
SPyQL is really cool and its design is very smart, with it being able to leverage normal Python functions!<p>As far as similar tools go, if you&#x27;re interested, I recommend taking a look at DataFusion[0], dsq[1], and OctoSQL[2].<p>DataFusion is a very (very very) fast command-line SQL engine but with limited support for data formats.<p>dsq is based on SQLite which means it has to load data into SQLite first, but then gives you the whole breath of SQLite, it also supports many data formats, but is slower at the same time.<p>OctoSQL is faster, extensible through plugins, and supports incremental query execution, so you can i.e. calculate and display a running group by + count while tailing a log file. It also supports normal databases, not just file formats, so you can i.e. join with a Postgres table.<p>[0]: <a href="https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;arrow-datafusion" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;arrow-datafusion</a><p>[1]: <a href="https:&#x2F;&#x2F;github.com&#x2F;multiprocessio&#x2F;dsq" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;multiprocessio&#x2F;dsq</a><p>[2]: <a href="https:&#x2F;&#x2F;github.com&#x2F;cube2222&#x2F;octosql" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;cube2222&#x2F;octosql</a><p>Disclaimer: Author of OctoSQL
评论 #33450177 未加载
评论 #33449727 未加载
评论 #33450202 未加载
gullywhumperover 2 years ago
See also Jeroen Janssens&#x27; Data Science at the Command Line:<p><a href="https:&#x2F;&#x2F;datascienceatthecommandline.com&#x2F;2e&#x2F;" rel="nofollow">https:&#x2F;&#x2F;datascienceatthecommandline.com&#x2F;2e&#x2F;</a>
samuellover 2 years ago
SPyQL looks fantastic!<p>The thing that worried me when looking into SQL-tools for CSV-files on the commandline, is the plethora of tools available, and it being hard to find one that feels solid and well-supported enough to become a &quot;default&quot; tool for many daily tasks.<p>I want to avoid investing a lot of time learning the ins and outs of a tool that might stop being developed in a year from now. I wish for something that can become the &quot;awk of tomorrow&quot;, but based on SQL or something similar.<p>Does anyone have any experiences related to that? Is my worry warranted? Are some projects more well supported than others?
评论 #33462463 未加载
pwallqvistover 2 years ago
Once your data is at a certain size, it might be worth considering tools that does the job quickly enough while still being simple to use. This comparison is very interesting:<p><a href="https:&#x2F;&#x2F;colab.research.google.com&#x2F;github&#x2F;dcmoura&#x2F;spyql&#x2F;blob&#x2F;master&#x2F;notebooks&#x2F;json_benchmark.ipynb" rel="nofollow">https:&#x2F;&#x2F;colab.research.google.com&#x2F;github&#x2F;dcmoura&#x2F;spyql&#x2F;blob&#x2F;...</a><p>Disclaimer: Work at ClickHouse, whose tool is part of the benchmarking efforts linked to above.
评论 #33449891 未加载
beckingzover 2 years ago
The best part is that doing analytics via the command line often means that you&#x27;re doing analytics locally, which often gets you performance superior to a small computing cluster.
photochemsynover 2 years ago
Very useful, seems to be an effective bridging tool between relational and NoSQL database types, and from the command line! Nice clear documentation page as well.
ptsnevesover 2 years ago
Was I the only one thinking of something like google analytics but for command line? A system of usability telemetry for command line utilities might be useful?
thriftwyover 2 years ago
Naturally, awk&#x2F;sort&#x2F;grep are often much more powerful than fiddling with fully qualified SQL.
muftyover 2 years ago
Looks really interesting, certainly something i will enjoy playing with. Great work
zX41ZdbWover 2 years ago
Does SPyQL have any advantages over clickhouse-local?
评论 #33450313 未加载