Polars Cloud: The Distributed Cloud Architecture to Run Polars Anywhere

261 pointsby neilfrndes2 months ago

19 comments

Every time I build something complex with dataframes in either R or Python (Pandas, I haven't used Polars yet), I end up really wishing I could have statically typed dataframes. I miss the security of knowing that when I change common code, the compiler will catch if I break a totally different part of the dashboard for instance.I'm aware of Pandera[1] which has support for Polars as well but, while nice, it doesn't cause the code to fail to compile, it only fails at runtime. To me this is the achilles heel of analysis in both Python and R.Does anybody have ideas on how this situation could be improved?[1] <a href="https://pandera.readthedocs.io/en/stable/" rel="nofollow">https://pandera.readthedocs.io/en/stable/</a>

评论 #43297588 未加载

评论 #43296460 未加载

评论 #43296536 未加载

评论 #43298068 未加载

评论 #43297332 未加载

评论 #43296539 未加载

评论 #43296337 未加载

评论 #43298794 未加载

评论 #43323663 未加载

评论 #43296705 未加载

0cf8612b2e1e2 months ago

I’ll bite- what’s the pitch vs Dask/Spark/Ray/etc?I am admittedly a tough sell when the workstation under my desk has 192GB of RAM.

评论 #43295255 未加载

评论 #43295034 未加载

评论 #43295208 未加载

评论 #43295089 未加载

评论 #43295231 未加载

评论 #43295087 未加载

LaurensBER2 months ago

This is very impressive and definitely fills a huge hole in the whole data frame ecosystem.I've been quite impressed with the Polars team and after using Pandas for years, Polars feels like a much needed fresh wind. Very excited to give this a go sometime soon!

jt_b2 months ago

Polars seems cool, but not willing to invest in adoption until Geo support is more mature. I find I'm preferring to run most operations I'd use dataframe libraries for in local SQL via DuckDB anyways.

robertkoss2 months ago

Love it! Competition for Databricks is always appreciated and I think having a competitor that is not running on the JVM is amazing. Working with polars feels always insanely lightweight compared to Spark. If you would provide Workflows / Scheduling out of the box, I would migrate my Spark jobs today :)

tfehring2 months ago

This is really cool, not sure how I missed it. I assume catalog support will be added fairly quickly. But ironically I think the biggest barrier to adoption will be the lack of an off-ramp to a FOSS solution that companies can self-host. Obviously Polars itself is FOSS, but it understandably seems like there's no way to self-host a backend to point a `pc.ComputeContext` to. That will be an especially tough selling point for companies that are already on Spark. I wonder how much they'll focus on startups vs. trying to get bigger companies to switch, and whether they'll try a Spark compatibility layer like DataFusion (<a href="https://github.com/apache/datafusion-comet">https://github.com/apache/datafusion-comet</a>).

评论 #43295400 未加载

marquisdepolis2 months ago

This is very interesting, clearly there's a major pain point here to be addressed, especially the delta between local pandas work and distributed [pyspark] work!Would love to test this out and do benchmarks against us/ Dask/ Spark/ Ray etc which have been our primary testing ground. Full disclosure, work at Bodo which has similar-ish aspirations (<a href="https://github.com/bodo-ai/Bodo">https://github.com/bodo-ai/Bodo</a>), but FOSS all the way.

Larrikin2 months ago

As a hobbyist, I describe polars as pandas if it was planned for humans to use. It's great to use, I just hate running into issues trying to use it. I wish them luck

efxhoy2 months ago

Looks great! Can I run it on my own bare metal cluster? Will I need to buy a license?

__mharrison__2 months ago

Really excited for the Polars team. I've always been impressed by their work and responsiveness to issues I've filed in the past. The world is lifted when there is good competition like this.

whyho2 months ago

How does this integrate into existing services like aws glue? I fear that despite polars being good/better it will lack adoption since it cannot easily be integrated.

评论 #43295260 未加载

TheAlchemist2 months ago

Having switched from Pandas to Polars recently, this is quite interesting and I guess performance wise it will be excellent.

melvinroest2 months ago

I just got into data analysis recently (former software engineer) and tried out pandas vs polars. I like polars way more because it feels like SQL but then sane, and it's faster. It's clear in what it tries to do. I didn't really have that with pandas.

评论 #43295446 未加载

评论 #43298388 未加载

评论 #43295584 未加载

whalesalad2 months ago

Never understood these kinds of cloud tools that deal with big data. You are paying enormous ingress/egress fees to do this.

评论 #43295306 未加载

评论 #43295258 未加载

评论 #43298332 未加载

babuloseo2 months ago

I applied :D just now hehehe

otteromkram2 months ago

How is this not an advertisement? Does HN tag those or nah?

c7THEC2DDFVV2V2 months ago

who covers egress costs?

评论 #43298695 未加载

Starlord20482 months ago

I can appreciate the pain points you guys are addressing.The "diagonal scaling" approach seems particularly clever - dynamically choosing between horizontal and vertical scaling based on the query characteristics rather than forcing users into a one-size-fits-all model. Most real-world data workloads have mixed requirements, so this flexibility could be a major advantage.I'm curious how the new streaming engine with out-of-core processing will compare to Dask, which has been in this space for a while but hasn't quite achieved the adoption of pandas/PySpark despite its strengths.The unified API approach also tackles a real issue. The cognitive overhead of switching between pandas for local work and PySpark for distributed work is higher than most people acknowledge. Having a consistent mental model regardless of scale would be a productivity boost.Anyway, I would love to apply for the early access and try it out. I'd be particularly interested in seeing benchmark comparisons against Ray, Dask, and Spark for different workload profiles. Also curious about the pricing model and the cold start problem that plagues many distributed systems.

评论 #43295794 未加载

marxisttemp2 months ago

What does this project have to do with Serbia? They’re based in the Netherlands. They must have made a mistake when registering their domain name.

评论 #43298199 未加载