TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: Work with CSV files using SQL. For data scientists and engineers

99 点作者 tanin将近 4 年前

26 条评论

gaha将近 4 年前
The way I love doing these kinds of things is by literally using three lines of R code:<p>library(sqldf)<p>tab1 = read.csv(&quot;file1.csv&quot;)<p>sqldf(&quot;select * from tab1&quot;)<p>sqldf [1] is a library with which you can access R dataframes just like tables in SQL (it is actually using SQLite in the background). I do not have much experience with programming in R itself, but you barely need it. This approach has the advantage that it is very flexible, and you get the power of the (IMHO) best plotting library ggplot with it. Of course, you can also do more complicated stuff like joining multiple dataframes and basically everything else you can do with SQLite, and then store results back in R dataframes. This workflow works if you use an IDE which lets you execute single lines or blocks of codes like R-Studio. Then you also get a nice GUI with it, but there are also plugins for VI&#x2F;Emacs that work very well.<p>[1] <a href="https:&#x2F;&#x2F;cran.r-project.org&#x2F;web&#x2F;packages&#x2F;sqldf&#x2F;index.html" rel="nofollow">https:&#x2F;&#x2F;cran.r-project.org&#x2F;web&#x2F;packages&#x2F;sqldf&#x2F;index.html</a><p>EDIT: code formatting
评论 #27876370 未加载
评论 #27876483 未加载
cube2222将近 4 年前
Great to see SQL being used more for such use cases! (I really like SQL)<p>This looks nice, especially with the charting!<p>It&#x27;s good to know though that there are a few similar open source tools already:<p>- q[0] - SQL on CSV and TSV<p>- TextQL[1] - SQL on CSV and TSV<p>- Datasette[2] - Running SQL on various datasets in various formats, by (if I understand it correctly) importing it to sqlite and running the query there. This one&#x27;s the closest one to the posted project I think, with interactive data exploration and publishing built in.<p>- OctoSQL[3] - SQL on files like CSV, JSON as well as non-file databases, endless streams with temporal SQL extensions and joining everything together in a single query.<p>The performance numbers of this tool however are impressive if they will check out with real data. 1GB in 10 seconds is fast. However, looking at the documentation it looks like most of the heavy lifting is done by SQLite under the hood. From the docs:<p>&gt; Under the hood, Superintendent loads CSVs into a SQLite database.<p>Disclaimer: I&#x27;m the author of OctoSQL. Still actively developing a rewrite on the redesign branch.<p>[0]: <a href="https:&#x2F;&#x2F;github.com&#x2F;harelba&#x2F;q" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;harelba&#x2F;q</a><p>[1]: <a href="https:&#x2F;&#x2F;github.com&#x2F;dinedal&#x2F;textql" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;dinedal&#x2F;textql</a><p>[2]: <a href="https:&#x2F;&#x2F;datasette.io" rel="nofollow">https:&#x2F;&#x2F;datasette.io</a><p>[3]: <a href="https:&#x2F;&#x2F;github.com&#x2F;cube2222&#x2F;octosql" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;cube2222&#x2F;octosql</a>
评论 #27872498 未加载
评论 #27874815 未加载
评论 #27874457 未加载
rognjen将近 4 年前
Seems useful but at the same time it&#x27;s comparison with loading CSV into SQL seems kind of disingenuous since if you&#x27;re familiar with SQL you&#x27;re probably familiar with command line enough to load CSV into SQLite (which is free in both meanings)
评论 #27877628 未加载
jonnycomputer将近 4 年前
So its a GUI wrapper around sqlite. I guess I expect data scientists and engineers to be comfortable with command line and scripting, and there are GUIs that will work with sqlite already, so I guess I&#x27;m having hard time figuring out the target audience for this.
评论 #27878713 未加载
EMM_386将近 4 年前
This open source tool on Github allows you to quickly import a CSV and write SQL queries against it. The CSV is imported into a SQLite database. It also does a lot more, highly recommended.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;sqlitebrowser&#x2F;sqlitebrowser" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;sqlitebrowser&#x2F;sqlitebrowser</a>
评论 #27872772 未加载
draegtun将近 4 年前
Perl DBD::CSV module is a classic in this genre (first version is from 1998) - <a href="https:&#x2F;&#x2F;metacpan.org&#x2F;pod&#x2F;DBD::CSV" rel="nofollow">https:&#x2F;&#x2F;metacpan.org&#x2F;pod&#x2F;DBD::CSV</a>
banana_giraffe将近 4 年前
&gt; A python script loading a 1GB file is likely to take more than 1 minutes on any machine.<p>Since when? I just tested it, and I can process a 1gb CSV file on my machine using the csv module in Python in ~10 seconds. My machine isn&#x27;t a world record setting machine, either.
评论 #27875977 未加载
sgt将近 4 年前
It seems interesting, but what is the advantage over just \copy in Postgres from a CSV file - or better yet, import into SQLite without even providing the table definition?
评论 #27872748 未加载
评论 #27874757 未加载
评论 #27872684 未加载
hprotagonist将近 4 年前
The datasette author offers this tool for conversion: <a href="https:&#x2F;&#x2F;github.com&#x2F;simonw&#x2F;csvs-to-sqlite" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;simonw&#x2F;csvs-to-sqlite</a>
评论 #27873338 未加载
beckingz将近 4 年前
Cool tool!<p>Panda SQL is great for doing SQL on pandas dataframes. <a href="https:&#x2F;&#x2F;pypi.org&#x2F;project&#x2F;pandasql&#x2F;" rel="nofollow">https:&#x2F;&#x2F;pypi.org&#x2F;project&#x2F;pandasql&#x2F;</a>
评论 #27876353 未加载
marto1将近 4 年前
Really cool and smooth! One killer feature here will be auto translating to pandas for people to copy over snippets to their Jupyter notebooks.<p>You don&#x27;t want to know how many times I&#x27;ve been asked to do exactly that.<p>Other than that one can do that (somewhat clumsily) with postgres using file_fdw like:<p>CREATE FOREIGN TABLE mytable ( datetime timestamp(3) with time zone, name text, score integer, ) SERVER myserver OPTIONS ( filename &#x27;data&#x2F;file1.csv&#x27;, format &#x27;csv&#x27; );<p>and then query as usual. More docs here: <a href="https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;13&#x2F;file-fdw.html" rel="nofollow">https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;13&#x2F;file-fdw.html</a>
hodgesrm将近 4 年前
You can query CSV directly without a DBMS backend using clickhouse-local. [1] It&#x27;s often used to clean data exactly as this article describes. You can also load CSV very easily into tables.<p>Superintendent.app seems to bring automatic schema definition to the table. We&#x27;ve discussed adding this to ClickHouse. Would be a great PR if anyone is interested.<p>[1] <a href="https:&#x2F;&#x2F;altinity.com&#x2F;blog&#x2F;2019&#x2F;6&#x2F;11&#x2F;clickhouse-local-the-power-of-clickhouse-sql-in-a-single-command" rel="nofollow">https:&#x2F;&#x2F;altinity.com&#x2F;blog&#x2F;2019&#x2F;6&#x2F;11&#x2F;clickhouse-local-the-pow...</a><p>(This note brought to you by the ClickHouse evangelization task force.)
评论 #27875938 未加载
评论 #27877687 未加载
aphrax将近 4 年前
This area interests me but being completely honest there are a few off-putting grammatical&#x2F;spelling errors. Might be worth getting them cleaned up..
评论 #27874719 未加载
eatonphil将近 4 年前
Very neat! I&#x27;m also working on an (open-source&#x2F;core) app for running SQL against any data (copy-pasted stuff, data files, logs, HTTP responses, other SQL requests) and being able to script and graph as well.<p>A difference might be that the goal I have for DataStation is to aid more in data exploration (but not being limited to smaller datasets like this site points out Google Sheets and Excel are). Some optimizations like streaming all data (and streaming through scripts) are currently out of scope and that will definitely limit the scale DataStation can handle.<p>It&#x27;s primarily to help out the case where you have a bunch of different data and you want to join&#x2F;filter&#x2F;group and graph the result for some business&#x2F;product&#x2F;engineering executive team.<p><a href="https:&#x2F;&#x2F;datastation.multiprocess.io&#x2F;" rel="nofollow">https:&#x2F;&#x2F;datastation.multiprocess.io&#x2F;</a>
jdnier将近 4 年前
If your work is more OLAP focused, DuckDb has good tools for loading csv files. <a href="https:&#x2F;&#x2F;duckdb.org&#x2F;docs&#x2F;data&#x2F;csv" rel="nofollow">https:&#x2F;&#x2F;duckdb.org&#x2F;docs&#x2F;data&#x2F;csv</a><p>DuckDB is like a columnar SQLite. It also has great support for Pandas DataFrames.
kmerroll将近 4 年前
Loading and running SQL on structured text files (.CSV) is definitely an underserved data tools niche, but this statement threw me: &quot;No more custom and slow Python script. No more vlookup, no more pivot tables.&quot; Wow, I&#x27;ll keep my slow Python&#x2F;Pandas script thank you.
cube00将近 4 年前
No license keys required with H2 <a href="http:&#x2F;&#x2F;www.h2database.com&#x2F;html&#x2F;tutorial.html#csv" rel="nofollow">http:&#x2F;&#x2F;www.h2database.com&#x2F;html&#x2F;tutorial.html#csv</a>
robbiejs将近 4 年前
Looks really nice. In what language is the app itself written? If it&#x27;s web app, I&#x27;m curious to know what table component you are using.<p>Also, for anyone looking to edit simple csv files in an excel-like matter, please check <a href="https:&#x2F;&#x2F;editcsvonline.com" rel="nofollow">https:&#x2F;&#x2F;editcsvonline.com</a>, a free tool that I made, powered by DataGridXL (which I also made).
adam_gyroscope将近 4 年前
You can do this with bit.io, although it’s not local; import your csv (drag and drop or browse to upload) and get a Postgres database in the cloud. Full disclosure I’m a founder at bit.io.<p>I love to see anything that makes it easier to use data!
gigatexal将近 4 年前
I do this already with sqlite3
refactor_master将近 4 年前
If I open a CSV with pandas in debug mode I’m also pretty much free to play with whatever transformations I want, without being restricted by SQL syntax. Saved as a parquet, reloads are also pretty much instantaneous.
评论 #27874766 未加载
_1tan将近 4 年前
Is this similar to <a href="https:&#x2F;&#x2F;github.com&#x2F;harelba&#x2F;q" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;harelba&#x2F;q</a> ?
chrisweekly将近 4 年前
This reminds me of <a href="https:&#x2F;&#x2F;lnav.org" rel="nofollow">https:&#x2F;&#x2F;lnav.org</a>
mjirv将近 4 年前
I like that this lets you join CSVs, which a lot of tools in this space don’t.
评论 #27876431 未加载
dantetheinferno将近 4 年前
Why isn&#x27;t this opensourced?
评论 #27880396 未加载
toyg将近 4 年前
I&#x27;m old enough to know I should give a pass to anything &quot;free while in beta&quot;...