科技回声

2 条评论

ehsantn3 个月前

I think staying in Python is best to ensure the team's productivity. There are a lot of tools that help accelerate and scale Pandas code while trying to be as compatible as possible. You can look at "Pandas on Spark", Snowpark (if on Snowflake), Bodo, Dask, and Modin for example.Disclaimer: I work on Bodo and wanted to share it in case others find it useful. <a href="https://github.com/bodo-ai/Bodo">https://github.com/bodo-ai/Bodo</a>

realityfactchex3 个月前

Don't reinvent the wheel. Don't gravitate toward shiny solutions to solved problems. Don't buy the hype train. Don't pay for what you don't need.Just use Apache Airflow. Self-host it or use the vanilla managed solution at whatever cloud you are on.Plain Jane, classic style, vanilla Airflow pipelines are the embodiment of dead simple Python concepts and syntax.Yes, it is the boring solution from 10 years ago. Yes, it should work fine. Yes, the current version is nicer than the version from 10 years ago.The hardest part for you may be writing down good norms for the users to know about: suggested (required?) naming conventions, maybe creating common modules they may want include to implement the best practices you list for them (for them to reuse), etc.The hardest part for the data scientists and analysts at first will be knowing the one or two idiosyncrasies about the scheduling parameters. And that's not much more than cron syntax and dependency lists. Document those for the analysts and scientists. Suggest good naming conventions, provide a few reusable pieces of code in utility modules if you want that people can import into their pipelines. Have a best practices wiki page internally.Store the pipeline code in SCM. Let them all search each others' pipelines for inspiration. Sync the repo to the dag bag folder using a Github Action or whatever it is you use for that.If that sounds like too much, let me offer one more idea, to make this as simple as possible without your or the data analysts and data scientists going insane, to cement the proposition. Keep an "examples" directory in the dag bag. In it, have one examples/thing_it_does pipeline for each thing that is a common pattern needed at your org. Then, you and the data scientists and data analysts (a) have the canonical example "proving" that that one thing works, since it is possible to copy and run that pipeline as-is in minimally modified form. And (b) there is a holy grail of copy-pasteable code that the data people can use to go to town (or at least to refer to as the simple case, when a lot of the actual pipelines start to get bigger).Then you're just about done. You could blow up the Airflow server and your databases, and if you have the pipeline code and the pipelines were idempotent, and the sources still exist, they will just pick up and rebuild whatever got lost.If Airflow pipelines are too hard for some data analysts and data scientists, then let those people stay in Pandas.But for your data people who want to pipeline stuff in SCM in a sensible way... without adopting anything too shiny... I don't see how this could go wrong.If some teams are extra crazy, give them their own instance, but really I would try not to do that.Just have enough scheduler and worker nodes available, I guess.And don't worry about multiple Airflow environments. Gimme a break. Test and run it all in Prod. Seriously. (This is where those naming/usage conventions come into play.)You can connect to whatever data sources and targets you need to with the current Airflow, I imagine.There are many hosting options for the full range of how much help you do or do not need/want.SQL is the lingua franca of data. Use that.

Ask HN: Going beyond Pandas for analysis, how to stay sane?

2 条评论

Ask HN: Going beyond Pandas for analysis, how to stay sane?

2 条评论