TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Parallel Out-Of-Core Dataframes in Python: Dask and OpenStreetMap

48 pointsby Lofkinalmost 10 years ago

3 comments

IndianAstronautalmost 10 years ago
This is awesome. One of the reasons I shifted away from Pandas is it's difficulty in dealing with out of core data. Can't wait to try this out.
评论 #10066231 未加载
bjlkengalmost 10 years ago
Comparison of PySpark vs Dask:<p><a href="http:&#x2F;&#x2F;dask.pydata.org&#x2F;en&#x2F;latest&#x2F;spark.html" rel="nofollow">http:&#x2F;&#x2F;dask.pydata.org&#x2F;en&#x2F;latest&#x2F;spark.html</a>
评论 #10064304 未加载
justinsaccountalmost 10 years ago
Their conclusion is interesting:<p><pre><code> If you have a terabyte or less of CSV or JSON data then you should forget both Spark and Dask and use Postgres or MongoDB. </code></pre> I don&#x27;t really have &quot;big data&quot; problems. I have &quot;annoying data&quot; problems. 500G of 10:1 compressed csv log files that I want to run reports on every now and then. Often just count or topk by a column, but sometimes grouping+counting (i.e, sum of column 5 grouped by column 3 where column 2=&#x27;foo&#x27;)<p>I&#x27;ve been looking into tools like Spark and Drill, but my tests running on a single machine found them to be extremely slow. Maybe things would be faster if I converted the log files to their native formats?<p>I&#x27;ve been considering trying to load the data into a postgres db using cstore_fdw, but what I really just want is a high performance sql engine for flat files, something probably like Kdb.<p>Like this article that I read recently: <a href="http:&#x2F;&#x2F;www.frankmcsherry.org&#x2F;graph&#x2F;scalability&#x2F;cost&#x2F;2015&#x2F;01&#x2F;15&#x2F;COST.html" rel="nofollow">http:&#x2F;&#x2F;www.frankmcsherry.org&#x2F;graph&#x2F;scalability&#x2F;cost&#x2F;2015&#x2F;01&#x2F;...</a> I know this can be done efficiently enough on a single machine, I just need the right software.
评论 #10068947 未加载