Assuming you want to do this on a single machine with finite memory, a feasible approach would be:<p>1. Split into multiple shard using e.g. `split`, where each piece is small enough to fit comfortably into memory
2. Sort each shard. You could use something like `zsv` (sql sub-command) (<a href="https://github.com/liquidaty/zsv" rel="nofollow">https://github.com/liquidaty/zsv</a>) or `xsv` (sort subcommand) (<a href="https://github.com/BurntSushi/xsv" rel="nofollow">https://github.com/BurntSushi/xsv</a>) for this
3. Use mergesort to sort all shards, at the same time, into a single output table, and in the process, de-dupe and/or merge (if you had too many shards for your operating system to allow so many files to be open at once, you might need to do an intermediate merge to consolidate shards)<p>For #3, I don't know of a tool to do this-- probably one exists for simple de-duping, but it may be harder if you need support for merge logic. If a tool for this does not exist, I would be imagine `zsv` could quite be easily extended to handle it.
Put into a database, most databases can import CSV files directly.<p>From there you can start create SELECT queries and depending how much processing you need to do you can create intermediate views for multiple steps. After that your can export the data directly to CSV.<p><a href="https://www.mysqltutorial.org/import-csv-file-mysql-table/" rel="nofollow">https://www.mysqltutorial.org/import-csv-file-mysql-table/</a><p><a href="https://www.mysqltutorial.org/mysql-export-table-to-csv/" rel="nofollow">https://www.mysqltutorial.org/mysql-export-table-to-csv/</a><p><a href="https://www.sqlitetutorial.net/sqlite-import-csv/" rel="nofollow">https://www.sqlitetutorial.net/sqlite-import-csv/</a><p><a href="https://www.sqlitetutorial.net/sqlite-export-csv/" rel="nofollow">https://www.sqlitetutorial.net/sqlite-export-csv/</a>
I would write a script in PHP to iterate over the CSV (though most languages would work here). In the first pass, scan for the rows that should be merged. On the second pass, when a merged row is encountered, write it if it hasn't been written yet. If you do this, just be sure to make a demo or two first to ensure that the behavior is what you expect. Also make use of buffering since reading and writing a file line-by-line will probably be very slow compared to reading it in at blocks at a time
I once created a simple script in NodeJS to merge two or more CSV files and remove duplicates. You might find it useful.<p><a href="https://www.npmjs.com/package/merge-csv-keepass" rel="nofollow">https://www.npmjs.com/package/merge-csv-keepass</a>
Seems like a naive approach will quickly run out of memory trying to detect duplicates. When possible, I would try to put the data into your favorite database system. A lot of optimization and knowledge on how to handle such things is in those systems.
How about this. For each row, compute an 8 bit hash and write to a file whose name is the hash value. Now you have 256 files that you can dedupe in memory (and if not, use a 10 bit hash or whatever).