This is awesome. One of the reasons I shifted away from Pandas is it's difficulty in dealing with out of core data. Can't wait to try this out.
Comparison of PySpark vs Dask:<p><a href="http://dask.pydata.org/en/latest/spark.html" rel="nofollow">http://dask.pydata.org/en/latest/spark.html</a>
Their conclusion is interesting:<p><pre><code> If you have a terabyte or less of CSV or JSON data
then you should forget both Spark and Dask and use
Postgres or MongoDB.
</code></pre>
I don't really have "big data" problems. I have "annoying data" problems. 500G of 10:1 compressed csv log files that I want to run reports on every now and then. Often just count or topk by a column, but sometimes grouping+counting (i.e, sum of column 5 grouped by column 3 where column 2='foo')<p>I've been looking into tools like Spark and Drill, but my tests running on a single machine found them to be extremely slow. Maybe things would be faster if I converted the log files to their native formats?<p>I've been considering trying to load the data into a postgres db using cstore_fdw, but what I really just want is a high performance sql engine for flat files, something probably like Kdb.<p>Like this article that I read recently: <a href="http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html" rel="nofollow">http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...</a> I know this can be done efficiently enough on a single machine, I just need the right software.