I am contracted by a major financial services firm to refactor an analytical model used for revenue forecasting to PySpark executing on a AWS EMR cluster.<p>The project's current status is documented[0].<p>The client's team responsible for operationalization was successful in refactoring another analytical model into Python/pandas. The current model execution time for a 5 year scenario is ~17 hours. Most of that time is
spent executing poorly crafted Oracle SQL queries drawing millions of rows into the analytical run-time for, sorting, aggregation, discarding, merging, and spliting tasks.<p>In order to constrain this exeuction time, final input is a sample of ~1.8M rows from a loan portfolio of ~81M records.<p>The client is concerned about performance and believes PySpark is the preferred target language.<p>I have been on this project for just one month, but I contracted previously at the same firm on a six months to refactor another model into Python/pandas. That project was successful, mainly due to the team leader's rigor for meeting milestones and ability to remove blockers for the team.<p>I recently discussed these projects with @Travis Oliphant who had some interesting ideas on Python-based frameworks to overcome issues for processing out-of-core dataframes. We discussed the frameworks Dask[1], Coiled.io, commercial Dask support[2], Ray[3], Modin, commercial support for Ray[4].<p>Others discussed were, Databricks[5], bodo.ai[6], Voltron Data[7], and AtScale[8]. On Reddit, the
commentary for Snowflake was very positive[9].<p>Easing maintenence burdens to keep the model in production and devising new scenarios (e.g. Covid-19 effects on forebarance requests) are requirements. Its shelf-life is years, making maintainability a major consideration.<p>What have others experienced in scaling out for teams familiar with Python/pandas for feature engineering tasks?<p>Is PySpark a dead-end libray in the Python ecosystem?<p>[0] https://www.pythonforsasusers.com/project_summary/current_project_status.html<p>[1] https://dask.org/<p>[2] https://coiled.io/<p>[3] https://docs.ray.io/en/ray-0.4.0/pandas_on_ray.html<p>[4] https://modin.readthedocs.io/en/stable/<p>[5] https://docs.databricks.com/languages/pandas-spark.html (which points to Apache's Pandas API on Spark)<p>[6] https://bodo.ai/<p>[7] https://wesmckinney.com/blog/from-ursa-to-voltrondata/<p>[8] https://www.atscale.com/autonomous-data-engineering/<p>[9] https://www.reddit.com/r/dataengineering/comments/r893rw/why_is_snowflake_so_popular/