Joining CSV Data Without SQL: An IP Geolocation Use Case

91 点作者 jameskerr超过 1 年前

20 条评论

Vt71fcAqt7超过 1 年前

>But before you can get there, as a prerequisite you need to enumerate all the field names and desired data types to load it into a specific structured format. Maybe there’s another way?In sqlite, this is just:.mode csv.import data.csv table>When .import is run, its treatment of the first input row depends upon whether the target table already exists. If it does not exist, the table is automatically created and the content of the first input row is used to set the name of all the columns in the table.[0][0] <a href="https://www.sqlite.org/cli.html#importing_files_as_csv_or_other_formats" rel="nofollow noreferrer">https://www.sqlite.org/cli.html#importing_files_as_csv_or_ot...</a>

评论 #37950159 未加载

wiml超过 1 年前

If you really want to, there's also the `join` command, which has been part of the standard unix toolset for longer than many of us have been alive.

评论 #37951641 未加载

评论 #37950589 未加载

smartmic超过 1 年前

> a performant join that avoids the hassle of SQL tables.Hm, a hassle is not always the same hassle for everyone. Personally, I like SQL and I enjoy the power it can leverage on all kind of data. As others wrote, working with sqlite for local data processing is a tool I do not want to miss, besides all of the great Unix coreutils and its (mostly GNU) ecosystem.

评论 #37949875 未加载

Pxtl超过 1 年前

Personally my weapon-of-choice fort his kind of thing is just raw Powershell. It means all your queries will be simple linear-searches, and powershell is a warty-as-hell language, but the ergonomics aren't bad.For example, the "all the cities in Iran" query.<pre><code> Import-FromCsv GeoLite2-City-Locations-en.csv | Where-Object {$_.country_name -eq "Iran"} | Select-Object -ExpandProperty city_name </code></pre> You could probably find modules to help with the IP-aware stuff like `cidr_match`, but the real place where it would probably fall over probably performance when joining, since you'd probably be just be doing O(n*m) convolution operations.

评论 #37950167 未加载

jakjak123超过 1 年前

I have done some similar, simpler data wrangling with xsv (<a href="https://github.com/BurntSushi/xsv">https://github.com/BurntSushi/xsv</a>) and jq. It could process my 800M rows in a couple of minutes (plus the time to read it out from the database =)

评论 #37953164 未加载

评论 #37952144 未加载

nerdponx超过 1 年前

CLI data processing is always fun and cool. But it tends to also be limited in scope and practicality, not to mention performance if you're chaining operations between function calls and it needs to re-parse the data every time.If you want to avoid SQL, it's really hard to beat a "data frame" data structure for tabular data processing including things like joins.

flusteredBias超过 1 年前

I do the same with DuckDB and pretty print with tidy-viewer.

chaps超过 1 年前

Ehhhhhhhh. It hasn't really made a super convincing argument not to use SQL. A lot of what is described isn't intinsicly a SQL problem, but instead an ETL problem. Eg, the use of complex types can be done by loading a csv into a TEXT table and then test typecasting on a sample set, followed by the full set.And geo indexes are no joke. Using them has made 8hr long SQL queries take seconds.

评论 #37948113 未加载

gabinator超过 1 年前

Wait, you guys aren't copy and pasting every CSV by hand into an Excel spreadsheet?

评论 #37953397 未加载

dima55超过 1 年前

Alternative very appropriate for some uses cases: `vnl-join` from the vnlog toolkit (<a href="https://github.com/dkogan/vnlog">https://github.com/dkogan/vnlog</a>). Uses the `join` tool from coreutils (works well, has been around forever), and `vnlog` for nice column labelling

brennaw1超过 1 年前

I'm pretty impressed by how Zed seems to handle the CSV overhead we typically see with standard SQL. That 'gradual slope' concept and the one-shot query without a ton of preprocessing? Pretty slick. Seeing the CSV parse transition to Zed lake queries resulting in that kind of speed-up is intriguing. Before jumping on board, though, I'd be curious to see how Zed holds up with even bigger datasets. The CIDR match and join ops are a nice touch, making it feel a tad SQL-like.

zenincognito超过 1 年前

Or, you can also use openrefine from Google.Currently mangling a 4 GB file and working with api's that use existing data columns to provide output.. Its a great tool.

评论 #37948708 未加载

holoduke超过 1 年前

I would rather create a quick sqllite file. Create some tables with dbeaver and create a csv processor to write to the sqllite file. Max 1 hour.

kelsey9876543超过 1 年前

if you need more tools and power, check out csvkit: <a href="https://csvkit.readthedocs.io/en/latest/tutorial/3_power_tools.html#csvjoin-merging-related-data" rel="nofollow noreferrer">https://csvkit.readthedocs.io/en/latest/tutorial/3_power_too...</a>

hermitcrab超过 1 年前

You can join 2 CSV files in a GUI ETL tool like Easy Data Transform in 3 clicks:-drag the 2 CSV files onto the canvas-click 'Join'-select the 2 columns to join

thedougd超过 1 年前

I needed to do this yesterday. Thankfully I recall a hacker news suggestion to use SQLite. No time at all to get both files imported into tables and successful join queries. I’m glad it was SQL as I needed a few basic transforms (case folding, trim, etc) and conditions.

tobilg超过 1 年前

I‘d rather just use DuckDB and write a few SQLs, and be done.This can be done in the terminal as well.

评论 #37952494 未加载

shortrounddev2超过 1 年前

would be cool to see a writeup converting these to equivalent powershell queries

nathants超过 1 年前

yes! non standard data wrangling, even if just for fun, is great way to gain a better standing of your workload and hardware.tldr; [de]serialization is your bottleneck, after that it’s general data processing. both are wasting insane levels of cpu cycles. network and disk, when accessed linearly, are free.i remember first looking into this when ec2 i3 came out, only more so since. lambda for burst cpu capacity when you can’t wait 30s for ec2 spot is interesting too.<a href="https://nathants.com/posts/performant-batch-processing-with-bsv-s4-and-presto" rel="nofollow noreferrer">https://nathants.com/posts/performant-batch-processing-with-...</a>

qtd6214超过 1 年前

Please don’t do this for actual work you share with other people. There’s a good reason why pandas exists.

评论 #37947885 未加载