TechEcho

10 comments

mattewongabout 3 years ago

Assuming you want to do this on a single machine with finite memory, a feasible approach would be:1. Split into multiple shard using e.g. `split`, where each piece is small enough to fit comfortably into memory 2. Sort each shard. You could use something like `zsv` (sql sub-command) (<a href="https://github.com/liquidaty/zsv" rel="nofollow">https://github.com/liquidaty/zsv</a>) or `xsv` (sort subcommand) (<a href="https://github.com/BurntSushi/xsv" rel="nofollow">https://github.com/BurntSushi/xsv</a>) for this 3. Use mergesort to sort all shards, at the same time, into a single output table, and in the process, de-dupe and/or merge (if you had too many shards for your operating system to allow so many files to be open at once, you might need to do an intermediate merge to consolidate shards)For #3, I don't know of a tool to do this-- probably one exists for simple de-duping, but it may be harder if you need support for merge logic. If a tool for this does not exist, I would be imagine `zsv` could quite be easily extended to handle it.

toredabout 3 years ago

Put into a database, most databases can import CSV files directly.From there you can start create SELECT queries and depending how much processing you need to do you can create intermediate views for multiple steps. After that your can export the data directly to CSV.<a href="https://www.mysqltutorial.org/import-csv-file-mysql-table/" rel="nofollow">https://www.mysqltutorial.org/import-csv-file-mysql-table/</a><a href="https://www.mysqltutorial.org/mysql-export-table-to-csv/" rel="nofollow">https://www.mysqltutorial.org/mysql-export-table-to-csv/</a><a href="https://www.sqlitetutorial.net/sqlite-import-csv/" rel="nofollow">https://www.sqlitetutorial.net/sqlite-import-csv/</a><a href="https://www.sqlitetutorial.net/sqlite-export-csv/" rel="nofollow">https://www.sqlitetutorial.net/sqlite-export-csv/</a>

评论 #31097701 未加载

Minor49erabout 3 years ago

I would write a script in PHP to iterate over the CSV (though most languages would work here). In the first pass, scan for the rows that should be merged. On the second pass, when a merged row is encountered, write it if it hasn't been written yet. If you do this, just be sure to make a demo or two first to ensure that the behavior is what you expect. Also make use of buffering since reading and writing a file line-by-line will probably be very slow compared to reading it in at blocks at a time

评论 #31073726 未加载

modinfoabout 3 years ago

I once created a simple script in NodeJS to merge two or more CSV files and remove duplicates. You might find it useful.<a href="https://www.npmjs.com/package/merge-csv-keepass" rel="nofollow">https://www.npmjs.com/package/merge-csv-keepass</a>

评论 #31073920 未加载

zaikabout 3 years ago

Seems like a naive approach will quickly run out of memory trying to detect duplicates. When possible, I would try to put the data into your favorite database system. A lot of optimization and knowledge on how to handle such things is in those systems.

imichaelabout 3 years ago

How about this. For each row, compute an 8 bit hash and write to a file whose name is the hash value. Now you have 256 files that you can dedupe in memory (and if not, use a 10 bit hash or whatever).

评论 #31074235 未加载

mattewongabout 3 years ago

Are you asking how to do this with a single machine, or how to do this assuming you have access to cloud resources?

yuppie_scumabout 3 years ago

Stream it thru Kafka, Redis or Elasticache or something pub-sub like that and dump the resulting set?

wizwit999about 3 years ago

Convert to parquet and use spark.

IronWolveabout 3 years ago

Split it into smaller files.

10 comments

mattewongabout 3 years ago

toredabout 3 years ago

评论 #31097701 未加载

Minor49erabout 3 years ago

评论 #31073726 未加载

modinfoabout 3 years ago

评论 #31073920 未加载

zaikabout 3 years ago

imichaelabout 3 years ago

How about this. For each row, compute an 8 bit hash and write to a file whose name is the hash value. Now you have 256 files that you can dedupe in memory (and if not, use a 10 bit hash or whatever).

评论 #31074235 未加载

mattewongabout 3 years ago

Are you asking how to do this with a single machine, or how to do this assuming you have access to cloud resources?

yuppie_scumabout 3 years ago

Stream it thru Kafka, Redis or Elasticache or something pub-sub like that and dump the resulting set?

wizwit999about 3 years ago

Convert to parquet and use spark.

IronWolveabout 3 years ago

Split it into smaller files.

Ask HN: How to Deal with CSVs?

10 comments

Ask HN: How to Deal with CSVs?

10 comments