We need a better decomposition of scalability. Do you mean scalability in data or scalability in compute or scalability of both?<p>Definitions:<p>Scalability in Data (SD): doing fast computation on a <i>very</i> large number of rows<p>Scalability in compute (SC): doing slow computations on a large number of rows<p>For SD, I have found that a 16-32 core machine is more than enough for tens of billions of rows as long as your disk access is relatively fast (SSD vs. HDD). If you vectorize your compute operations you can typically get to within 10x the assembly compute time. This allows you to tap into in a 32 core machine for 10s of effective giga flops. These machines are rated at 100s of giga flops. For example I had to compute a metric on a 100 million row table (dataframe) which effectively required on the order of 10-20 tflops of compute. Single-core pandas was showing us 2 months of compute time. Using vectorization and using mp.Pool I was able to reduce to a few hours. The big win here was vectorization and not mp.Pool.<p>For Compute scalability - e.g. running multiple machine learning models which cannot be effectively limited to a single machine, nothing beats Dask. Dask is extremely mature, has seen a large number of real world cases and people have used it for hundreds of hours of uptime.<p>Vectorization is a oft unlooked realm of speedup which can easily give you 10-100x speedups in Pandas. Understanding vectorization and what it can and cannot do is a highly productive exercise.
Oh, I wrote this :) I submitted it last week but it didn't get much attention then.<p>Happy to answer questions as far as possible. I have used Pandas extensively but I don't have deep experience with all of these libraries so I learnt a lot while summarising them.<p>If you know more than I do and I made any mistakes, let me know and I'll get them corrected.
Articles like these are interesting, but what surprises me is that they rarely set up a holistic use case, so most debates imagine how long it would take an expert user to use each tool. But time constraints (eg spent coding) separates expert from novice performance in many domains.<p>FWIW I have 2020 set aside to implement siuba, a python port of the popular R library dplyr (siuba runs on top of pandas, but also can generate SQL). A huge source of inspiration has been screencasts by Dave Robinson using R to analyze data he's never seen before lightning fast.<p>Has anyone seen similar screencasts with pandas? I suspect it's not possible (given some constraints on its interface), but would love to be wrong here, because I'd like to keep all my work in python :o.<p>Expert R screencasts: <a href="https://youtu.be/NY0-IFet5AM" rel="nofollow">https://youtu.be/NY0-IFet5AM</a><p>Siuba: <a href="https://github.com/machow/siuba" rel="nofollow">https://github.com/machow/siuba</a>
The subtitle is "How can you process more data quicker?"<p>NumPy. It scores an A in Maturity and Popularity, and either an A or a B in Ease of Adoption depending on which Pandas features you use (e.g. GroupBy).<p>When you're using NumPy as the main show instead of an implementation detail inside Pandas, it is easier to adopt Numba or Cython, and there are huge gains to be made there. Most Pandas workloads on small clusters of say 10 machines or fewer could be implemented on a single machine.<p>Even simple operations on smallish data sets are often much faster in NumPy than Pandas.<p>You don't have to leave Pandas behind, just try using NumPy and Numba for the hot parts of your code. Numba even lets you write Python code that works with the GIL released, which can lead to linear speedup in the number of cores with much less work than multiprocessing without the overhead of copying data to multiple processes.
The tie breaker here really is kubernetes.
Most likely your company's infrastructure is run on k8s. As a data scientist you do not get control over that.<p>Dask natively integrates with Kubernetes. That's why I see a lot of people moving away even from Apache Spark (which is generally used through its inbuilt scheduler YARN) and towards Dask.<p>Second reason is that the dask-ml project is building seamless compatibility for higher order ML algorithms (sklearn,etc) on top of Dask. Not just Numpy/Pandas
I working on a project called CloudPy that's a 1-line drop in for pandas to process 100 GB+ dataframes.<p>It works by proxying the dataframe to a remote server, which can have a lot more memory than you're local server. The project is in beta right now, but please reach out if you're interested in trying it out! You can read more at <a href="http://www.cloudpy.io/" rel="nofollow">http://www.cloudpy.io/</a> or email hello@cloudpy.io
<a href="https://github.com/weld-project/weld" rel="nofollow">https://github.com/weld-project/weld</a> this is in the same domain of vaex I suppose, the scope is much larger though