TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Going beyond Pandas for analysis, how to stay sane?

4 点作者 dwrodri3 个月前
I am an ML Engineer at a Python shop supporting a team of 15-20 data analysts&#x2F;scientists with a wide range of experience. Most of my gig is building tooling for them and dogfooding that tooling to make sure it works well. All of our Data people know SQL pretty well, but we&#x27;d rather not let people run wild trying to write data transformations using two paradigms (Pandas API vs. SQL queries) if we don&#x27;t have to.<p>Am I chasing after a rainbow trying to provide a consistent DX here if I consider this to effectively be a solo project at this scale?<p>DuckDB seems like a promising contender here, but there is a whole new generation of tooling which has emerged to contend with all of the limitation of pandas.<p>Does anyone here have any positive stories of interacting with this tooling without effectively also signing up for huge maintenance efforts or a new expensive enterprise license? Open to any and all options that don&#x27;t require my end users to move away from Python.

2 条评论

ehsantn3 个月前
I think staying in Python is best to ensure the team&#x27;s productivity. There are a lot of tools that help accelerate and scale Pandas code while trying to be as compatible as possible. You can look at &quot;Pandas on Spark&quot;, Snowpark (if on Snowflake), Bodo, Dask, and Modin for example.<p>Disclaimer: I work on Bodo and wanted to share it in case others find it useful. <a href="https:&#x2F;&#x2F;github.com&#x2F;bodo-ai&#x2F;Bodo">https:&#x2F;&#x2F;github.com&#x2F;bodo-ai&#x2F;Bodo</a>
realityfactchex3 个月前
Don&#x27;t reinvent the wheel. Don&#x27;t gravitate toward shiny solutions to solved problems. Don&#x27;t buy the hype train. Don&#x27;t pay for what you don&#x27;t need.<p>Just use Apache Airflow. Self-host it or use the vanilla managed solution at whatever cloud you are on.<p>Plain Jane, classic style, vanilla Airflow pipelines are the embodiment of dead simple Python concepts and syntax.<p>Yes, it is the boring solution from 10 years ago. Yes, it should work fine. Yes, the current version is nicer than the version from 10 years ago.<p>The hardest part for you may be writing down good norms for the users to know about: suggested (required?) naming conventions, maybe creating common modules they may want include to implement the best practices you list for them (for them to reuse), etc.<p>The hardest part for the data scientists and analysts at first will be knowing the one or two idiosyncrasies about the scheduling parameters. And that&#x27;s not much more than cron syntax and dependency lists. Document those for the analysts and scientists. Suggest good naming conventions, provide a few reusable pieces of code in utility modules if you want that people can import into their pipelines. Have a best practices wiki page internally.<p>Store the pipeline code in SCM. Let them all search each others&#x27; pipelines for inspiration. Sync the repo to the dag bag folder using a Github Action or whatever it is you use for that.<p>If that sounds like too much, let me offer one more idea, to make this as simple as possible without your or the data analysts and data scientists going insane, to cement the proposition. Keep an &quot;examples&quot; directory in the dag bag. In it, have one examples&#x2F;thing_it_does pipeline for each thing that is a common pattern needed at your org. Then, you and the data scientists and data analysts (a) have the canonical example &quot;proving&quot; that that one thing works, since it is possible to copy and run that pipeline as-is in minimally modified form. And (b) there is a holy grail of copy-pasteable code that the data people can use to go to town (or at least to refer to as the simple case, when a lot of the actual pipelines start to get bigger).<p>Then you&#x27;re just about done. You could blow up the Airflow server and your databases, and if you have the pipeline code and the pipelines were idempotent, and the sources still exist, they will just pick up and rebuild whatever got lost.<p>If Airflow pipelines are too hard for some data analysts and data scientists, then let those people stay in Pandas.<p>But for your data people who want to pipeline stuff in SCM in a sensible way... without adopting anything too shiny... I don&#x27;t see how this could go wrong.<p>If some teams are extra crazy, give them their own instance, but really I would try not to do that.<p>Just have enough scheduler and worker nodes available, I guess.<p>And don&#x27;t worry about multiple Airflow environments. Gimme a break. Test and run it all in Prod. Seriously. (This is where those naming&#x2F;usage conventions come into play.)<p>You can connect to whatever data sources and targets you need to with the current Airflow, I imagine.<p>There are many hosting options for the full range of how much help you do or do not need&#x2F;want.<p>SQL is the lingua franca of data. Use that.
评论 #43021765 未加载
评论 #43018238 未加载