TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Best way to do heavy csv processing?

1 pointsby laxentaskenover 6 years ago
I got a couple of big csv files (~5-10GB with millions of rows) that needs to be processed (linked to older files and updating the data etc) and then exported to new csv files.<p>The data follows the relation model but just updating one field after dumping it into postgresql takes quite some time (doing a update on join) and I&#x27;m not sure this is the most effective tool&#x2F;way for this kind of work. The only queries that will be run is to update or doing inserts&#x2F;append new data to existing tables (eg older files).<p>Do you have any suggestions to look into for a workload like this?

3 comments

cypherdtraitorover 6 years ago
I have done this several times now. CSV just plain sucks.<p>1. Rip all CSV data into SQLite or another tabular database<p>2. Do all data manipulations by shifting information between the database and memory. Ideally you pull entire columns at a time. 95% of your runtime is going to be spent pulling and pushing data, so minimize the number of calls however possible.<p>3. Export the database to CSV and ship it back to the customer.<p>If you use a particular language a lot, it is worth it to write a text scanner that uses low level APIs to read large CSV files quickly. I usually pipe a million characters at a time, submit most of them to the database, the duct tape the last few characters to the next million that I pull.
jondegenhardtover 6 years ago
Not a specific approach to your problem, but a resource that may be useful is <a href="https:&#x2F;&#x2F;github.com&#x2F;dbohdan&#x2F;structured-text-tools" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;dbohdan&#x2F;structured-text-tools</a>. It lists an extensive set of command line tools useful for working with this class of files.
geophileover 6 years ago
This is not a lot of data. Without trying hard, you should be able to import the data within an hour. After that, it&#x27;s basic postgres optimization. But it&#x27;s hard to offer advice with such a vague description of the update you are attempting.
评论 #18269065 未加载