TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Best way to do heavy csv processing?

1 点作者 laxentasken超过 6 年前
I got a couple of big csv files (~5-10GB with millions of rows) that needs to be processed (linked to older files and updating the data etc) and then exported to new csv files.<p>The data follows the relation model but just updating one field after dumping it into postgresql takes quite some time (doing a update on join) and I&#x27;m not sure this is the most effective tool&#x2F;way for this kind of work. The only queries that will be run is to update or doing inserts&#x2F;append new data to existing tables (eg older files).<p>Do you have any suggestions to look into for a workload like this?

3 条评论

cypherdtraitor超过 6 年前
I have done this several times now. CSV just plain sucks.<p>1. Rip all CSV data into SQLite or another tabular database<p>2. Do all data manipulations by shifting information between the database and memory. Ideally you pull entire columns at a time. 95% of your runtime is going to be spent pulling and pushing data, so minimize the number of calls however possible.<p>3. Export the database to CSV and ship it back to the customer.<p>If you use a particular language a lot, it is worth it to write a text scanner that uses low level APIs to read large CSV files quickly. I usually pipe a million characters at a time, submit most of them to the database, the duct tape the last few characters to the next million that I pull.
jondegenhardt超过 6 年前
Not a specific approach to your problem, but a resource that may be useful is <a href="https:&#x2F;&#x2F;github.com&#x2F;dbohdan&#x2F;structured-text-tools" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;dbohdan&#x2F;structured-text-tools</a>. It lists an extensive set of command line tools useful for working with this class of files.
geophile超过 6 年前
This is not a lot of data. Without trying hard, you should be able to import the data within an hour. After that, it&#x27;s basic postgres optimization. But it&#x27;s hard to offer advice with such a vague description of the update you are attempting.
评论 #18269065 未加载