科技回声

3 条评论

wting将近 12 年前

This is an excessively long blog post that basically states: do stream processing when your data set doesn't fit into memory.<pre><code> with open('in.txt') as input, open('out.txt') as out: for line in input.readlines(): out.write(foo(line)) </code></pre> Python users are used to importing everything all at once, while in C everything is done in small chunks whenever possible.Python 3 is also moving into this direction by replacing many default functions with their iterator equivalents (map, range, etc).You might think that this means forcing everything into one big context manager, but that's not necessarily true. For example:<pre><code> from itertools import imap def read_file(filename): with open(filename, 'r') as f: reader = csv.reader(f) for line in reader: yield line def write_file(filename, data): with open(filename, 'w') as f: writer = csv.writer(f) map(writer.writerow, data) write_file( filename='out.txt', data=imap(foo, read_file('in.txt')))</code></pre>

评论 #6076880 未加载

csense将近 12 年前

There was an article a few weeks ago on the front page of HN about an interview question for data scientists that was essentially the "exact-split" problem mentioned in the end of the article. The article (or maybe it was the comment thread) showed an algorithm to randomly split a file of size m+n into disjoint sublists of size m and n, using a single pass through the data and O(n) memory.This blog post's algorithm accomplishes the same task with two passes and O(m+n) space. It seems odd that an article explicitly about encouraging readers to think like the authors of UNIX and make simple reusable utilities that process stream data would use a two-pass algorithm when a fairly simple one-pass algorithm is available.

评论 #6079552 未加载

walshemj将近 12 年前

Give me strength Big data is NOT when its to large to fit in primary memory.

评论 #6079231 未加载

3 条评论

wting将近 12 年前

评论 #6076880 未加载

csense将近 12 年前

评论 #6079552 未加载

walshemj将近 12 年前

Give me strength Big data is NOT when its to large to fit in primary memory.

评论 #6079231 未加载

Processing large files, line by line

3 条评论

Processing large files, line by line

3 条评论