TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Processing large files, line by line

22 点作者 rayvega将近 12 年前

3 条评论

wting将近 12 年前
This is an excessively long blog post that basically states: do stream processing when your data set doesn&#x27;t fit into memory.<p><pre><code> with open(&#x27;in.txt&#x27;) as input, open(&#x27;out.txt&#x27;) as out: for line in input.readlines(): out.write(foo(line)) </code></pre> Python users are used to importing everything all at once, while in C everything is done in small chunks whenever possible.<p>Python 3 is also moving into this direction by replacing many default functions with their iterator equivalents (map, range, etc).<p>You might think that this means forcing everything into one big context manager, but that&#x27;s not necessarily true. For example:<p><pre><code> from itertools import imap def read_file(filename): with open(filename, &#x27;r&#x27;) as f: reader = csv.reader(f) for line in reader: yield line def write_file(filename, data): with open(filename, &#x27;w&#x27;) as f: writer = csv.writer(f) map(writer.writerow, data) write_file( filename=&#x27;out.txt&#x27;, data=imap(foo, read_file(&#x27;in.txt&#x27;)))</code></pre>
评论 #6076880 未加载
csense将近 12 年前
There was an article a few weeks ago on the front page of HN about an interview question for data scientists that was essentially the &quot;exact-split&quot; problem mentioned in the end of the article. The article (or maybe it was the comment thread) showed an algorithm to randomly split a file of size m+n into disjoint sublists of size m and n, using a single pass through the data and O(n) memory.<p>This blog post&#x27;s algorithm accomplishes the same task with two passes and O(m+n) space. It seems odd that an article explicitly about encouraging readers to think like the authors of UNIX and make simple reusable utilities that process stream data would use a two-pass algorithm when a fairly simple one-pass algorithm is available.
评论 #6079552 未加载
walshemj将近 12 年前
Give me strength Big data is NOT when its to large to fit in primary memory.
评论 #6079231 未加载