I often find myself having to wait up to multiple hours (in one case days) waiting for an ML model to train or to tune some hyper parameters. I recently discovered Dask and with small changes to my Python code I saved many hours of compute. I only wish I knew about it sooner.<p>Does anyone else spend hours waiting for their computation to run? If yes, I'd love to<p>Know which tasks are the most time-consuming for you (in terms of waiting for CPU to finish)<p>Learn how you deal with large datasets when doing distributed computing<p>Feel free to comment below or message me if you'd like to share tips, links etc. I feel like a noob for not knowing about it until now and am afraid I might be missing other important use cases / tools etc. Any and all feedback is appreciated!
The Julia programming language would help speed up computation.<p><a href="https://julialang.org/benchmarks/" rel="nofollow">https://julialang.org/benchmarks/</a><p>You can use Julia with Apache Spark and Julia works with Python via PyCall. If you are working with tabular data the Julia SparkSQL.jl package lets you create Spark apps using just Julia and SQL:<p><a href="https://github.com/propelledanalytics/SparkSQL.jl" rel="nofollow">https://github.com/propelledanalytics/SparkSQL.jl</a><p>Tutorials:<p><a href="https://propelledanalytics.github.io/Tutorials/" rel="nofollow">https://propelledanalytics.github.io/Tutorials/</a>