TechEcho

This is awesome. One of the reasons I shifted away from Pandas is it's difficulty in dealing with out of core data. Can't wait to try this out.

Comparison of PySpark vs Dask:<a href="http://dask.pydata.org/en/latest/spark.html" rel="nofollow">http://dask.pydata.org/en/latest/spark.html</a>

Their conclusion is interesting:<pre><code> If you have a terabyte or less of CSV or JSON data then you should forget both Spark and Dask and use Postgres or MongoDB. </code></pre> I don't really have "big data" problems. I have "annoying data" problems. 500G of 10:1 compressed csv log files that I want to run reports on every now and then. Often just count or topk by a column, but sometimes grouping+counting (i.e, sum of column 5 grouped by column 3 where column 2='foo')I've been looking into tools like Spark and Drill, but my tests running on a single machine found them to be extremely slow. Maybe things would be faster if I converted the log files to their native formats?I've been considering trying to load the data into a postgres db using cstore_fdw, but what I really just want is a high performance sql engine for flat files, something probably like Kdb.Like this article that I read recently: <a href="http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html" rel="nofollow">http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...</a> I know this can be done efficiently enough on a single machine, I just need the right software.

This is awesome. One of the reasons I shifted away from Pandas is it's difficulty in dealing with out of core data. Can't wait to try this out.

Comparison of PySpark vs Dask:<a href="http://dask.pydata.org/en/latest/spark.html" rel="nofollow">http://dask.pydata.org/en/latest/spark.html</a>

Parallel Out-Of-Core Dataframes in Python: Dask and OpenStreetMap

3 comments

Parallel Out-Of-Core Dataframes in Python: Dask and OpenStreetMap

3 comments