Summary in the conclusion:<p>"The end result is an implementation several orders of magnitude faster than the current reference implementation in Java. ... [Python] makes the first version easy to implement and provides plenty of powerful tools for optimization later when you understand where and how you need it." [edited to be a statement instead of rhetorical question]
But you can have both. In Scala I can write prototypes just as rapidly as Python, but I can run them with close-to-native performance. I can even explore interactively in a REPL but backed by the power of my company's big computer cluster, using spark-shell. The profiling capabilities are excellent, but when I spot a bottleneck I can solve it in the language directly, without needing the awkwardness of cython or of converting to/from numpy formats.<p>(And while I personally love Scala, there's nothing magic about it in this regard. There's no reason a language can't offer Python-like expressiveness and Java-like performance, and many modern languages do)
Large-scale data processing jobs normally arrange themselves into data acquistion/cleaning, grunt numerical work and result formatting/display. These tasks have very different requirements so a combination of a tool that can do all the data handling easily (ie Python) + a tool that can throw the CPU at a numerical problem (ie C) will work as a great combination.<p>In contrast, if you work in Java, you are trying to use the same tool for both jobs, and you may well fall between 2 stools. And I say that as a typical Java-head.<p>My only question about the 2-tool combination is whether there are better combinations. Python has all the libraries and community support so any alternative would need similar. Maybe Node?<p>As for the number crunching, I think Rust would be a better choice here. Good memory management is its USP and that can have significant performance benefits.
This may be a liiiitle bit off-topic, but I really need to get it off my chest: Python for high-performance scientific computer works <i>beautifully</i>... it's a dream. Scipy/numpy, matplotlib, pandas, ipython. They're all unbelievably awesome. It all just works.<p><i>Except</i>, when you're on Windows, and it just doesn't. Just installing things and doing the 'hello world' for aforementioned libraries is laughably impossible.<p>So, use Python, but use it only on Linux.<p>(Okay, if you absolutely must do it in Windows: Use Anaconda).
> once I had a decent algorithm, I could turn to Cython to tighten up the bottlenecks and make it fast.<p>What are your preferred ways to profile Python code? Coming recently from PHP, where we have XDebug/KCachegrind, the excellent Facebook-sponsored Xhprof, <a href="https://blackfire.io" rel="nofollow">https://blackfire.io</a> and <a href="https://tideways.io" rel="nofollow">https://tideways.io</a>, it's felt a step backwards.<p>I've tried line_profiler, and used memory_profiler and cProfile with pyprof2calltree and KCachegrind. I've found the cProfile output confusing when it crosses the Python-C barrier for numpy, sklearn etc.
Why isn't Haskell, or any other functional language, popular for this sort of thing? Turning A into B is what FP excels at, and you shouldn't have to reason about side effects, besides writing the graph images somewhere.<p>From what I've heard from a friend of using other people's code in one particular scientific field (stringly type some of the things, probably accidentally, don't document this), an at-least-passable type system would be a huge improvement.
I have in my hands a pretty interesting BI project for a big company. So far, the proposal on the table has been .NET and SQL Server, but I am wondering if I should at least try to give python a chance. Pandas is a great library, with great people working on it. Django the same. On the other hand, .NET has lots of professional (aka: with paid licenses) libraries that seem more fit for an enterprise project. Looking from a company perspective, the drawback python has is, strangely, the lack of paid for alternatives. It's not that people in companies don't trust open source (hadoop is becoming big here too), but one wonders if the developers will be able to find the support they need in case any issue arise from a free library.
A very beginner Java programmer here. It's a nicely organized notebook, great demo, but: seems like a lot of effort was put into optimizing the python efforts, and none for Java. Isn't that an unfair comparison?<p>My real question is, is it so much easier to do this excercise in Python than Java, assuming equal proficiency in either case?
Agreed. Python is an excellent tool in that respect. Batteries included helps. Being able to access fast C routines help. Compile to C projects like Numba and Cython also help. And of course, ipython (Jupyter) notebooks for exploration.
Jake Vanderplas previously wrote an excellent blog post about Python performance and scientific computing: <a href="https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/" rel="nofollow">https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow...</a>
The interesting insight from the article is that python might be a good language for learning algorithms. The fast development time allows you to write a complete program (albeit clunky) without the pre-optimizing you might be tempted to do in other languages.
My current setup for "scientific computing" is RStudio and CSV files for quickly running typical stats-tests (a couple of t-tests and a tost + krippendorff's alpha the last couple of month) and python+libraries for anything that resembles "building stuff" (mostly scikit-learn to build some classifiers).
I mostly use R "as a consumer" i.e. I basically use RStudio whenever my colleagues fire up SPSS.
That combination works fairly well. I'd recommend it to anyone who enters academia in any field that involves statistics who doesn't want to use the typical proprietary tools (I've also tried PSPP and it works ok for basic tasks but lacks a lot of functionality. If all you want to do is run a quick t-test or ANOVA it's a decent tool).
If Numpy, Pandas, etc. were wrappable from JavaScript this could have easily been titled "Why I use Node.js for High Performance Scientific Computing".<p>The "Python" here isn't particularly material to the result, it's mostly a wrapper around C. Toss in Cython, and now you've really gone outside the bounds of "I'm just using 'Python' for HPC!".<p>I agree some of the tooling and niceties are beyond a doubt best in breed with Python, but it's disingenuous to equate this to "writing HPC code in Python". If you had written a RPython to Verilog translator that produced an FPGA of your algorithm would you call that "using Python"?