TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: How to Deal with CSVs?

3 pointsby nullbytesmatterabout 3 years ago
I have a massive CSV (~1.6TB) and I need to merge rows that reference the same data. What are your goto tools for such tasks/data?

10 comments

mattewongabout 3 years ago
Assuming you want to do this on a single machine with finite memory, a feasible approach would be:<p>1. Split into multiple shard using e.g. `split`, where each piece is small enough to fit comfortably into memory 2. Sort each shard. You could use something like `zsv` (sql sub-command) (<a href="https:&#x2F;&#x2F;github.com&#x2F;liquidaty&#x2F;zsv" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;liquidaty&#x2F;zsv</a>) or `xsv` (sort subcommand) (<a href="https:&#x2F;&#x2F;github.com&#x2F;BurntSushi&#x2F;xsv" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;BurntSushi&#x2F;xsv</a>) for this 3. Use mergesort to sort all shards, at the same time, into a single output table, and in the process, de-dupe and&#x2F;or merge (if you had too many shards for your operating system to allow so many files to be open at once, you might need to do an intermediate merge to consolidate shards)<p>For #3, I don&#x27;t know of a tool to do this-- probably one exists for simple de-duping, but it may be harder if you need support for merge logic. If a tool for this does not exist, I would be imagine `zsv` could quite be easily extended to handle it.
toredabout 3 years ago
Put into a database, most databases can import CSV files directly.<p>From there you can start create SELECT queries and depending how much processing you need to do you can create intermediate views for multiple steps. After that your can export the data directly to CSV.<p><a href="https:&#x2F;&#x2F;www.mysqltutorial.org&#x2F;import-csv-file-mysql-table&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.mysqltutorial.org&#x2F;import-csv-file-mysql-table&#x2F;</a><p><a href="https:&#x2F;&#x2F;www.mysqltutorial.org&#x2F;mysql-export-table-to-csv&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.mysqltutorial.org&#x2F;mysql-export-table-to-csv&#x2F;</a><p><a href="https:&#x2F;&#x2F;www.sqlitetutorial.net&#x2F;sqlite-import-csv&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.sqlitetutorial.net&#x2F;sqlite-import-csv&#x2F;</a><p><a href="https:&#x2F;&#x2F;www.sqlitetutorial.net&#x2F;sqlite-export-csv&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.sqlitetutorial.net&#x2F;sqlite-export-csv&#x2F;</a>
评论 #31097701 未加载
Minor49erabout 3 years ago
I would write a script in PHP to iterate over the CSV (though most languages would work here). In the first pass, scan for the rows that should be merged. On the second pass, when a merged row is encountered, write it if it hasn&#x27;t been written yet. If you do this, just be sure to make a demo or two first to ensure that the behavior is what you expect. Also make use of buffering since reading and writing a file line-by-line will probably be very slow compared to reading it in at blocks at a time
评论 #31073726 未加载
modinfoabout 3 years ago
I once created a simple script in NodeJS to merge two or more CSV files and remove duplicates. You might find it useful.<p><a href="https:&#x2F;&#x2F;www.npmjs.com&#x2F;package&#x2F;merge-csv-keepass" rel="nofollow">https:&#x2F;&#x2F;www.npmjs.com&#x2F;package&#x2F;merge-csv-keepass</a>
评论 #31073920 未加载
zaikabout 3 years ago
Seems like a naive approach will quickly run out of memory trying to detect duplicates. When possible, I would try to put the data into your favorite database system. A lot of optimization and knowledge on how to handle such things is in those systems.
imichaelabout 3 years ago
How about this. For each row, compute an 8 bit hash and write to a file whose name is the hash value. Now you have 256 files that you can dedupe in memory (and if not, use a 10 bit hash or whatever).
评论 #31074235 未加载
mattewongabout 3 years ago
Are you asking how to do this with a single machine, or how to do this assuming you have access to cloud resources?
yuppie_scumabout 3 years ago
Stream it thru Kafka, Redis or Elasticache or something pub-sub like that and dump the resulting set?
wizwit999about 3 years ago
Convert to parquet and use spark.
IronWolveabout 3 years ago
Split it into smaller files.