TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

A SQLite extension for reading large files line-by-line

157 pointsby polyrandalmost 3 years ago

9 comments

alexgarcia-xyzalmost 3 years ago
Hey, author here, happy to answer any questions! Also checkout this notebook for a deeper dive into sqlite-lines, along with a slick WASM demonstration and more thoughts on the codebase itself <a href="https:&#x2F;&#x2F;observablehq.com&#x2F;@asg017&#x2F;introducing-sqlite-lines" rel="nofollow">https:&#x2F;&#x2F;observablehq.com&#x2F;@asg017&#x2F;introducing-sqlite-lines</a><p>I really dig SQLite, and I believe SQLite extensions will push it to another level. I rarely reach for Pandas or other &quot;traditional&quot; tools and query languages, and instead opt for plain ol&#x27; SQLite and other extensions. As a shameless plug, I recently started a blog series on SQLite and related tools and extensions if you want to learn more! Next week I&#x27;ll be publishing more SQLite extensions for parsing HTML + making HTTP requests <a href="https:&#x2F;&#x2F;observablehq.com&#x2F;@asg017&#x2F;a-new-sqlite-blog-series" rel="nofollow">https:&#x2F;&#x2F;observablehq.com&#x2F;@asg017&#x2F;a-new-sqlite-blog-series</a><p>A few other SQLite extensions:<p>- xlite, for reading Excel files, in Rust <a href="https:&#x2F;&#x2F;github.com&#x2F;x2bool&#x2F;xlite" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;x2bool&#x2F;xlite</a><p>- sqlean, several small SQLite extensions in C <a href="https:&#x2F;&#x2F;github.com&#x2F;nalgeon&#x2F;sqlean" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;nalgeon&#x2F;sqlean</a><p>- mergestat, several SQLite extensions for developers (mainly Github&#x27;s API) in Go <a href="https:&#x2F;&#x2F;github.com&#x2F;mergestat&#x2F;mergestat" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;mergestat&#x2F;mergestat</a>
评论 #32289671 未加载
评论 #32289677 未加载
simonwalmost 3 years ago
I love this example from the docs:<p><pre><code> select line -&gt; &#x27;$.id&#x27; as id, line -&gt; &#x27;$.name&#x27; as name from lines_read(&quot;my-file.ndjson&quot;); </code></pre> This is using the new -&gt; JSON operator which was added to SQLite a few months ago.<p>The lines_read() thing there is a table-valued function, which means it returns a virtual table that you can query. This is a streaming operation which means it&#x27;s safe to run it against a 100GB+ file without worrying about it sucking all of the data into memory at once.<p>Where this gets really useful is when you combine it with a create table statement:<p><pre><code> create table names as select line -&gt; &#x27;$.id&#x27; as id, line -&gt; &#x27;$.name&#x27; as name from lines_read(&quot;my-file.ndjson&quot;); </code></pre> This should efficiently create a table with that exact subset of the data pulled from the newline-delimited JSON file.
评论 #32291338 未加载
cube2222almost 3 years ago
Hey!<p>OctoSQL[0] author here, this is really impressive! I like this much more than the approach taken by other sqlite-based tools which first load stuff into SQLite and then let you query it.<p>On the other hand, it does have a cons that it doesn&#x27;t automatically infer the schema of the input JSON and you still have to manually parse the raw lines. Maybe it would be possible to surmount this by exposing a json-dedicated file reading function which also does SQL inference (I&#x27;m not knowledgable about SQLite internals)?<p>One piece of feedback is with regard to the benchmarks: I think it would be worth it to add additional benchmarks which work on slightly more complex datasets than the one used here. I did a comparison of this vs OctoSQL on the Brazil dataset, and - as expected - sqlite-lines wiped the floor with it. However, then I ran the following queries on a slightly more complex dataset (the Amazon review dataset in this case, from SPyQL&#x27;s benchmark notebook[1]):<p><pre><code> ~&gt; time OCTOSQL_NO_TELEMETRY=1 .&#x2F;octosql &quot;SELECT COUNT(*), AVG(overall) FROM books.json WHERE reviewerName = &#x27;James&#x27;&quot; +-------+-------------------+ | count | avg_overall | +-------+-------------------+ | 3010 | 4.402325581395349 | +-------+-------------------+ real 0m49.805s user 0m32.169s sys 0m9.163s ~&gt; time .&#x2F;lines0-linux-amd64-sqlite3 :memory: &quot;SELECT COUNT(*), AVG(json_extract(line, &#x27;$.overall&#x27;)) FROM lines_read(&#x27;books.json&#x27;) WHERE json_extract(line, &#x27;$.reviewerName&#x27;) = &#x27;James&#x27;&quot; 3010|4.40232558139535 real 1m47.933s user 1m27.024s sys 0m11.559s </code></pre> and as you can see, the results go in a very different direction.<p>But anyhow, congrats on the project, and I&#x27;m pumped to see what you come up with next!<p>[0]: <a href="https:&#x2F;&#x2F;github.com&#x2F;cube2222&#x2F;octosql" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;cube2222&#x2F;octosql</a><p>[1]: <a href="https:&#x2F;&#x2F;github.com&#x2F;dcmoura&#x2F;spyql&#x2F;blob&#x2F;master&#x2F;notebooks&#x2F;json_benchmark.ipynb" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;dcmoura&#x2F;spyql&#x2F;blob&#x2F;master&#x2F;notebooks&#x2F;json_...</a>
评论 #32289743 未加载
freecodyxalmost 3 years ago
I like the fact that author is extensively documenting the source code, following the sqlite spirit
bicijayalmost 3 years ago
Thats interesting, is there any way to add &quot;custom delimiter templates&quot; (or alternative tools)? I have some files where each line contains different kind of informations, but those are separated by their index on the row, so for example, customer name is the letter index 3 to letter index 10.
评论 #32289058 未加载
评论 #32289412 未加载
qbasic_foreveralmost 3 years ago
Very cool looking! I love using SQL for ad-hoc data analysis, my brain just groks the language so much more easily than other data query languages like python + pandas.
jokoonalmost 3 years ago
I&#x27;ve read that spatialite is not well maintained, is that true?
randyrandalmost 3 years ago
nice simple &amp; reliable approach to a hard perf problem.
didgetmasteralmost 3 years ago
I have built my own database engine on top of a general-purpose data management system I invented. It has an application that will let you create a DB table from a file (Json, JsonL, CSV, etc.). The application uses the Qt windowing framework that also includes a Json parser so I just used it.<p>It seems slower than I would like, so I wonder if your code could be used as a stand alone library for parsing the files? My current implementation is able to parse about 25K Json documents per second when each document has 20 KV pairs. I used the Chicago crime data available for download on their open data portal.<p>How fast will your code parse it? Here is an example of three lines:<p>[ { &quot;ID&quot; : &quot;10224738&quot;, &quot;Case Number&quot; : &quot;HY411648&quot;, &quot;Date&quot; : &quot;2015-09-05T13:30:00&quot;, &quot;Block&quot; : &quot;043XX S WOOD ST&quot;, &quot;IUCR&quot; : &quot;0486&quot;, &quot;Primary Type&quot; : &quot;BATTERY&quot;, &quot;Description&quot; : &quot;DOMESTIC BATTERY SIMPLE&quot;, &quot;Location Description&quot; : &quot;RESIDENCE&quot;, &quot;Arrest&quot; : &quot;false&quot;, &quot;Domestic&quot; : &quot;true&quot;, &quot;Beat&quot; : 924, &quot;District&quot; : 9, &quot;Ward&quot; : 12, &quot;Community Area&quot; : 61, &quot;FBI Code&quot; : &quot;08B&quot;, &quot;X Coordinate&quot; : &quot;1165074&quot;, &quot;Y Coordinate&quot; : &quot;1875917&quot;, &quot;Year&quot; : 2015, &quot;Latitude&quot; : 41.815117, &quot;Longitude&quot; : -87.670000 }, { &quot;ID&quot; : &quot;10224742&quot;, &quot;Case Number&quot; : &quot;HY411435&quot;, &quot;Date&quot; : &quot;2015-09-05T10:55:00&quot;, &quot;Block&quot; : &quot;082XX S LOOMIS BLVD&quot;, &quot;IUCR&quot; : &quot;0610&quot;, &quot;Primary Type&quot; : &quot;BURGLARY&quot;, &quot;Description&quot; : &quot;FORCIBLE ENTRY&quot;, &quot;Location Description&quot; : &quot;RESIDENCE&quot;, &quot;Arrest&quot; : &quot;false&quot;, &quot;Domestic&quot; : &quot;false&quot;, &quot;Beat&quot; : 614, &quot;District&quot; : 6, &quot;Ward&quot; : 21, &quot;Community Area&quot; : 71, &quot;FBI Code&quot; : &quot;05&quot;, &quot;X Coordinate&quot; : &quot;1168430&quot;, &quot;Y Coordinate&quot; : &quot;1850165&quot;, &quot;Year&quot; : 2015, &quot;Latitude&quot; : 41.744379, &quot;Longitude&quot; : -87.658431 }, { &quot;ID&quot; : &quot;10224745&quot;, &quot;Case Number&quot; : &quot;HY411654&quot;, &quot;Date&quot; : &quot;2015-09-05T11:30:00&quot;, &quot;Block&quot; : &quot;031XX W WASHINGTON BLVD&quot;, &quot;IUCR&quot; : &quot;0320&quot;, &quot;Primary Type&quot; : &quot;ROBBERY&quot;, &quot;Description&quot; : &quot;STRONGARM - NO WEAPON&quot;, &quot;Location Description&quot; : &quot;STREET&quot;, &quot;Arrest&quot; : &quot;false&quot;, &quot;Domestic&quot; : &quot;true&quot;, &quot;Beat&quot; : 1222, &quot;District&quot; : 12, &quot;Ward&quot; : 27, &quot;Community Area&quot; : 27, &quot;FBI Code&quot; : &quot;03&quot;, &quot;X Coordinate&quot; : &quot;1155536&quot;, &quot;Y Coordinate&quot; : &quot;1900515&quot;, &quot;Year&quot; : 2015, &quot;Latitude&quot; : 41.882814, &quot;Longitude&quot; : -87.704326 } ]
评论 #32314141 未加载