Every time I build something complex with dataframes in either R or Python (Pandas, I haven't used Polars yet), I end up really wishing I could have statically typed dataframes. I miss the security of knowing that when I change common code, the compiler will catch if I break a totally different part of the dashboard for instance.<p>I'm aware of Pandera[1] which has support for Polars as well but, while nice, it doesn't cause the code to fail to compile, it only fails at runtime. To me this is the achilles heel of analysis in both Python and R.<p>Does anybody have ideas on how this situation could be improved?<p>[1] <a href="https://pandera.readthedocs.io/en/stable/" rel="nofollow">https://pandera.readthedocs.io/en/stable/</a>
This is very impressive and definitely fills a huge hole in the whole data frame ecosystem.<p>I've been quite impressed with the Polars team and after using Pandas for years, Polars feels like a much needed fresh wind. Very excited to give this a go sometime soon!
Polars seems cool, but not willing to invest in adoption until Geo support is more mature. I find I'm preferring to run most operations I'd use dataframe libraries for in local SQL via DuckDB anyways.
Love it! Competition for Databricks is always appreciated and I think having a competitor that is not running on the JVM is amazing. Working with polars feels always insanely lightweight compared to Spark. If you would provide Workflows / Scheduling out of the box, I would migrate my Spark jobs today :)
This is really cool, not sure how I missed it. I assume catalog support will be added fairly quickly. But ironically I think the biggest barrier to adoption will be the lack of an off-ramp to a FOSS solution that companies can self-host. Obviously Polars itself is FOSS, but it understandably seems like there's no way to self-host a backend to point a `pc.ComputeContext` to. That will be an especially tough selling point for companies that are already on Spark. I wonder how much they'll focus on startups vs. trying to get bigger companies to switch, and whether they'll try a Spark compatibility layer like DataFusion (<a href="https://github.com/apache/datafusion-comet">https://github.com/apache/datafusion-comet</a>).
This is very interesting, clearly there's a major pain point here to be addressed, especially the delta between local pandas work and distributed [pyspark] work!<p>Would love to test this out and do benchmarks against us/ Dask/ Spark/ Ray etc which have been our primary testing ground. Full disclosure, work at Bodo which has similar-ish aspirations (<a href="https://github.com/bodo-ai/Bodo">https://github.com/bodo-ai/Bodo</a>), but FOSS all the way.
As a hobbyist, I describe polars as pandas if it was planned for humans to use. It's great to use, I just hate running into issues trying to use it. I wish them luck
Really excited for the Polars team. I've always been impressed by their work and responsiveness to issues I've filed in the past. The world is lifted when there is good competition like this.
How does this integrate into existing services like aws glue? I fear that despite polars being good/better it will lack adoption since it cannot easily be integrated.
I just got into data analysis recently (former software engineer) and tried out pandas vs polars. I like polars way more because it feels like SQL but then sane, and it's faster. It's clear in what it tries to do. I didn't really have that with pandas.
I can appreciate the pain points you guys are addressing.<p>The "diagonal scaling" approach seems particularly clever - dynamically choosing between horizontal and vertical scaling based on the query characteristics rather than forcing users into a one-size-fits-all model. Most real-world data workloads have mixed requirements, so this flexibility could be a major advantage.<p>I'm curious how the new streaming engine with out-of-core processing will compare to Dask, which has been in this space for a while but hasn't quite achieved the adoption of pandas/PySpark despite its strengths.<p>The unified API approach also tackles a real issue. The cognitive overhead of switching between pandas for local work and PySpark for distributed work is higher than most people acknowledge. Having a consistent mental model regardless of scale would be a productivity boost.<p>Anyway, I would love to apply for the early access and try it out. I'd be particularly interested in seeing benchmark comparisons against Ray, Dask, and Spark for different workload profiles. Also curious about the pricing model and the cold start problem that plagues many distributed systems.