Why I Still Use Python for High Performance Scientific Computing

273 pointsby subnaughtover 9 years ago

17 comments

mselloutover 9 years ago

Summary in the conclusion:"The end result is an implementation several orders of magnitude faster than the current reference implementation in Java. ... [Python] makes the first version easy to implement and provides plenty of powerful tools for optimization later when you understand where and how you need it." [edited to be a statement instead of rhetorical question]

评论 #10661282 未加载

评论 #10661281 未加载

评论 #10662071 未加载

lmmover 9 years ago

But you can have both. In Scala I can write prototypes just as rapidly as Python, but I can run them with close-to-native performance. I can even explore interactively in a REPL but backed by the power of my company's big computer cluster, using spark-shell. The profiling capabilities are excellent, but when I spot a bottleneck I can solve it in the language directly, without needing the awkwardness of cython or of converting to/from numpy formats.(And while I personally love Scala, there's nothing magic about it in this regard. There's no reason a language can't offer Python-like expressiveness and Java-like performance, and many modern languages do)

评论 #10661930 未加载

评论 #10668106 未加载

kitdover 9 years ago

Large-scale data processing jobs normally arrange themselves into data acquistion/cleaning, grunt numerical work and result formatting/display. These tasks have very different requirements so a combination of a tool that can do all the data handling easily (ie Python) + a tool that can throw the CPU at a numerical problem (ie C) will work as a great combination.In contrast, if you work in Java, you are trying to use the same tool for both jobs, and you may well fall between 2 stools. And I say that as a typical Java-head.My only question about the 2-tool combination is whether there are better combinations. Python has all the libraries and community support so any alternative would need similar. Maybe Node?As for the number crunching, I think Rust would be a better choice here. Good memory management is its USP and that can have significant performance benefits.

评论 #10662400 未加载

评论 #10662869 未加载

评论 #10663312 未加载

pen2lover 9 years ago

This may be a liiiitle bit off-topic, but I really need to get it off my chest: Python for high-performance scientific computer works beautifully... it's a dream. Scipy/numpy, matplotlib, pandas, ipython. They're all unbelievably awesome. It all just works.Except, when you're on Windows, and it just doesn't. Just installing things and doing the 'hello world' for aforementioned libraries is laughably impossible.So, use Python, but use it only on Linux.(Okay, if you absolutely must do it in Windows: Use Anaconda).

评论 #10662070 未加载

评论 #10662029 未加载

评论 #10661591 未加载

评论 #10663968 未加载

评论 #10665583 未加载

评论 #10662026 未加载

评论 #10664059 未加载

pbowyerover 9 years ago

> once I had a decent algorithm, I could turn to Cython to tighten up the bottlenecks and make it fast.What are your preferred ways to profile Python code? Coming recently from PHP, where we have XDebug/KCachegrind, the excellent Facebook-sponsored Xhprof, <a href="https://blackfire.io" rel="nofollow">https://blackfire.io</a> and <a href="https://tideways.io" rel="nofollow">https://tideways.io</a>, it's felt a step backwards.I've tried line_profiler, and used memory_profiler and cProfile with pyprof2calltree and KCachegrind. I've found the cProfile output confusing when it crosses the Python-C barrier for numpy, sklearn etc.

评论 #10661944 未加载

评论 #10661816 未加载

评论 #10661801 未加载

cballardover 9 years ago

Why isn't Haskell, or any other functional language, popular for this sort of thing? Turning A into B is what FP excels at, and you shouldn't have to reason about side effects, besides writing the graph images somewhere.From what I've heard from a friend of using other people's code in one particular scientific field (stringly type some of the things, probably accidentally, don't document this), an at-least-passable type system would be a huge improvement.

评论 #10661476 未加载

评论 #10661511 未加载

评论 #10661256 未加载

评论 #10661292 未加载

评论 #10661261 未加载

评论 #10661936 未加载

评论 #10664312 未加载

评论 #10663603 未加载

kfkover 9 years ago

I have in my hands a pretty interesting BI project for a big company. So far, the proposal on the table has been .NET and SQL Server, but I am wondering if I should at least try to give python a chance. Pandas is a great library, with great people working on it. Django the same. On the other hand, .NET has lots of professional (aka: with paid licenses) libraries that seem more fit for an enterprise project. Looking from a company perspective, the drawback python has is, strangely, the lack of paid for alternatives. It's not that people in companies don't trust open source (hadoop is becoming big here too), but one wonders if the developers will be able to find the support they need in case any issue arise from a free library.

评论 #10662032 未加载

评论 #10662039 未加载

评论 #10663219 未加载

评论 #10663481 未加载

评论 #10661985 未加载

评论 #10676460 未加载

banku_broughamover 9 years ago

A very beginner Java programmer here. It's a nicely organized notebook, great demo, but: seems like a lot of effort was put into optimizing the python efforts, and none for Java. Isn't that an unfair comparison?My real question is, is it so much easier to do this excercise in Python than Java, assuming equal proficiency in either case?

评论 #10663525 未加载

评论 #10663637 未加载

rdtscover 9 years ago

Agreed. Python is an excellent tool in that respect. Batteries included helps. Being able to access fast C routines help. Compile to C projects like Numba and Cython also help. And of course, ipython (Jupyter) notebooks for exploration.

cosmoharriganover 9 years ago

Jake Vanderplas previously wrote an excellent blog post about Python performance and scientific computing: <a href="https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/" rel="nofollow">https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow...</a>

daemonkover 9 years ago

The interesting insight from the article is that python might be a good language for learning algorithms. The fast development time allows you to write a complete program (albeit clunky) without the pre-optimizing you might be tempted to do in other languages.

kriroover 9 years ago

My current setup for "scientific computing" is RStudio and CSV files for quickly running typical stats-tests (a couple of t-tests and a tost + krippendorff's alpha the last couple of month) and python+libraries for anything that resembles "building stuff" (mostly scikit-learn to build some classifiers). I mostly use R "as a consumer" i.e. I basically use RStudio whenever my colleagues fire up SPSS. That combination works fairly well. I'd recommend it to anyone who enters academia in any field that involves statistics who doesn't want to use the typical proprietary tools (I've also tried PSPP and it works ok for basic tasks but lacks a lot of functionality. If all you want to do is run a quick t-test or ANOVA it's a decent tool).

hcrispover 9 years ago

Good article. Couldn't find who wrote it since it doesn't have a byline. I'm guessing it was Leland McInnes?

评论 #10661636 未加载

SeanDavover 9 years ago

Couple of questions:- Could this be/Was this developed in Python 3.x- what is this "notebook" he keeps on referring to?

评论 #10664018 未加载

buildopsover 9 years ago

Absolutely and it is even easier if you use Ceemple for your IDE

bipin_nagover 9 years ago

I use Spark. Will using Python help a lot ?

boulosover 9 years ago

If Numpy, Pandas, etc. were wrappable from JavaScript this could have easily been titled "Why I use Node.js for High Performance Scientific Computing".The "Python" here isn't particularly material to the result, it's mostly a wrapper around C. Toss in Cython, and now you've really gone outside the bounds of "I'm just using 'Python' for HPC!".I agree some of the tooling and niceties are beyond a doubt best in breed with Python, but it's disingenuous to equate this to "writing HPC code in Python". If you had written a RPython to Verilog translator that produced an FPGA of your algorithm would you call that "using Python"?

评论 #10661529 未加载

评论 #10661690 未加载

评论 #10661514 未加载

评论 #10662221 未加载

评论 #10661507 未加载

评论 #10674966 未加载

评论 #10662307 未加载