R, the master troll of statistical languages (2012)

184 点作者 Goldenromeo超过 9 年前

29 条评论

mziel超过 9 年前

The problem is people using R without trying to learn about the language itself, just assuming it works like their favourite language.For example complaining that R is slow and then writing iterative solution instead of using vectorization. When I saw the example the author gave my first thought was "sapply/lapply". Lapply is essential to the R use, and is being taught early on in every book/course on R I've ever saw."In 2012, I’m the kind of person who uses apply() a dozen times a day, and is vaguely aware that R has a million related built-in functions like sapply(), tapply(), lapply(), and vapply(), yet still has absolutely no idea what all of those actually do. "

评论 #11114624 未加载

评论 #11116030 未加载

评论 #11113171 未加载

评论 #11113659 未加载

评论 #11117744 未加载

评论 #11113384 未加载

评论 #11113302 未加载

评论 #11113221 未加载

评论 #11113846 未加载

wanderfowl超过 9 年前

I'm a Post-Doc in a small social sciences department in a major university, and am probably the department's ranking R-geek. I did my dissertation, and much of my current work, doing modeling, analysis, and even machine learning in R.In many ways, I owe much of my success to the power that R has allowed me to wield. Multicore lapplys and ggplot2 are my life these days. But even with this, R drives me absolutely batty, and the documentation, even battier.I may be competent relative to most, but R feels so taped-together and idiosyncratic that even on my best days, I just feel like a newbie who's built up an army of ugly hacks.Someday, I'll learn more about the python stats tools and do my stats there. But for now, R it is. Troll on, you crazy bastard.

评论 #11115089 未加载

评论 #11114080 未加载

评论 #11114532 未加载

louden超过 9 年前

R is a language with a lot of gotcha's. I usually get burned by characters being converted to factors in read.csv() and converting factors to numeric (it works, but not how you intend). The R Inferno (<a href="http://www.burns-stat.com/documents/books/the-r-inferno/" rel="nofollow">http://www.burns-stat.com/documents/books/the-r-inferno/</a>) has a lot of other gotcha's and is worth a read for people who use the language.That said, the power, flexibility and user community make it my go-to for any first crack at an analysis of data.

评论 #11113200 未加载

评论 #11113052 未加载

评论 #11113164 未加载

nathell超过 9 年前

I have a love-hate relationship with R, being a predominantly Clojure (and Ruby these days) programmer who only occassionally dabbles in data crunching.The apply/sapply dichotomy that the article mentions (actually a hexachotomy, there are also mapply, sapply, tapply and vapply) is one example of a gazillion warts that the language has.Another random one: R has a useful function, paste, that concatenates strings together. Only it takes varargs, not a character vector, so if you have a vector v of strings, you have to use do.call(paste, v). Only not, because do.call insists that its second argument be a list, not a vector, so you do do.call(paste, as.list(v)). And if you want to separate the strings, say, by commas, you have to affix the named argument sep, obtaining do.call(paste, c(as.list(v), sep=",")).And R's three mutually incompatible object systems. And so on and so on and so on.There are things to love. The packaging system works really well. I like the focus that R puts on documentation: hardly anywhere is it so comprehensive, with vignettes and all. There are things plainly inspired by Lisp (R is just about the only non-Lisp I know that has a condition and restart system akin to CL). And ggplot2 is one hell of a gem of API.In many ways, R is the PHP of data science. (Though the core language's still nowhere near as abysmal as PHP.) Despite all the warts, there are all sorts of statistical analyses that are just a package.install() away. Put another way, R is to data science what LaTeX is to typesetting. It's a heavy pile of ducttape, but it's here to stay because it's just so damn useful.

评论 #11113970 未加载

评论 #11113948 未加载

评论 #11115496 未加载

capnrefsmmat超过 9 年前

My biggest complaint about R isn't the inconsistency and obtuseness -- I've been using it long enough to get familiar with the documentation and the zillions of varieties of apply. My problem is the data structures.R has only a few core data structures: vectors, lists, arrays, and matrices. Data frames are built on top of lists, and admittedly data frames are incredibly useful for statistics -- there's a reason pandas exists, and a reason data analysis is much more tedious in other languages.But there are no hash maps or sets (lists have named elements, but with O(n) indexing; the only hash tables available use environments and accept limited types of keys), no tuples, no structural or record types, stacks and queues only recently became available on CRAN (through C), and so on.This leads to the folk belief that the only way to optimize R is to vectorize code or to write in in C or C++ (with Rcpp, for instance). No statistical programmer ever thinks about choosing the right data structure for the job, since you basically only ever use lists and data frames. Fast operations on data structures (like graph algorithms) have to be written in C. There's just no way to do it in R.When I co-taught a statistical computing course, covering the basics of data structures and algorithms, I included some homework assignments where the difference between a fast and a slow algorithm was the choice of data structure. R users struggled because they had very little available to them. If their code wasn't fast because they were doing O(n) list lookups in a loop, there wasn't anything they could do to fix it.I hope Python and Julia can eat R's lunch. Some day I'll have to get around to trying Julia for a serious project...

评论 #11116401 未加载

noelwelsh超过 9 年前

I think the general consensus is that R is a terrible language with a lot of useful libraries. I especially like that R claims to be inspired by Scheme, but the memo seems to have been "Make sure we f*ck this all up" taped to the front of the "Lambda the Ultimate" papers[1]. In particular, lexical scoping was one of the key innovations in Scheme and R has pervasively buggered up their implementation, from not distinguishing between defining and mutating a variable to making the default save/load procedures mutate the environment. OMG does R drive me insane (as a programming language person.)[Saying "there is package on CRAN that fixes this" is not a solution. A language shouldn't require extensive knowledge of the ecosystem to get the basics working properly.][1] Scheme was introduced to the world in the "Lambda the Ultimate" series of papers. See <a href="http://library.readscheme.org/page1.html" rel="nofollow">http://library.readscheme.org/page1.html</a>

评论 #11115141 未加载

tmalsburg2超过 9 年前

Writing a variant of this article has become a rite of passage for all serious users of R. There are two issues that contribute to the difficulties people experience with R. First, yes, R can be confusing at times. Tal explains this really well, but only scratches the surface. There is so much more confusing and counter-intuitive stuff, for example with regards to factors that only very few people seem to understand fully. However, there is a second issue, and this is less often acknowledged: People expect R to be immensely powerful and at the same time easy to use, which is really not a very reasonable thing to expect. This attitude is fairly specific to the R community. No C++ developer would dare to write a long rant about the shortcomings of C++ while at the same time nonchalantly admitting that they never made a serious attempt to learn it. One symptom of this problem is that hardly any self-proclaimed R hacker has read Matloff's book "The Art of R Programming" which was for a long time, and perhaps still is, the only book on R programming. The mere fact that there is (or was) only one such book speaks volumes.

评论 #11114927 未加载

评论 #11115840 未加载

justin_oaks超过 9 年前

Way too many languages are trolls, or have troll features. Or in other words, too many languages have features that don't do what a reasonable person would expect them to do.I've long considered implicit type conversion to be a troll feature, especially how Javascript does it. Another one is how differently Java treats primitives and Object types. Oracle databases treat nulls and empty strings the same.At times like this, all I can do is lament and search in vain for a language with no troll features.

评论 #11114107 未加载

chollida1超过 9 年前

I think this article nails exactly what's right and wrong with R.This in particular sums up the learning curve of R.> Thankfully, I’m long past the point where R syntax is perpetually confusing. I’m now well into the phase where it’s only frequently confusing, and I even have high hopes of one day making it to the point where it barely confuses me at all.Warning personal opinion ahead...R, the language can get you up and running alot faster than other languages for statistics like say python with Pandas or scipy but even people who use it on a daily basis will curse the languages "quirks". I find most of the confusion comes from R trying to be too friendly to the user via type conversions. The ease in which the R's type system will convert values has probably caused me more grief when first learning the language than any other issue I ran into.And this illustrates the down side of using R> library(Hmisc) apply(ice.cream, 2, all.is.numeric)> …which had the desirable property of actually working. But it still wasn’t very satisfactory, because it requires loading a pretty large library (Hmisc) with a bunch of dependencies just to do something very simple that should really be doable in the base R distribution.Since R is rarely a programmer's most used language, I find there tends to be an above average use of google and paste type code that pulls in 50 different packages, each of which is used on 1-2 lines of a 1000 line script. Perhaps this is just a function of most programmers not really understanding the mathematical domain and hence they slowly google and iterate their way towards a solution.Often I'll see people pull in 5 different time series libraries just because each of them operate on a ts object, so they all can work on the same object, and each one provides one additional method the other's don't and the programmer needs to create their solution.You'll hear people talk about writing R in the Hadley universe or the basic R universe but there isn't much talk about what a canonical R solution looks like. R is a great language in the sense that Perl and C++. It allows you to do anything but there often isn't an agreed upon way of writing it and two different programmers can come up with wildly different but valid solutions to the same problem.

评论 #11113071 未加载

evandev超过 9 年前

My thoughts (a little of a Rant) on R as the Lead Engineer at a data science focused company is that R is a great statistical language, but a poor programming language. I use the term programming language as a language which is very versatile for a variety of needs (web app, commandline app) such as python, ruby, etc. R has the capabilities to use as a programming language like a climbing rope can be used as a belt. It can, but shouldn't because of some points I have below.It is great for exploratory analysis, as it is forgiving and easy to use in the console for testing things; but once it needs to be put into practice, it has issues. For a non-programmer, grasping R isn't too hard thanks to some great developers in the community.There is a lot of good in the R community, but people are focused on making it isn't. Just look at deploying R into production, that can be a nightmare. I've spent days looking over code to figure out where an error in production lies. One of the errors was a package of a package which was updated for the first time in years. That package depended on another package which my package called another function that called the first one; basically it was a mess of dependencies. And there are some misconceptions, while doing the engineering work in R and learning I learned not to use for loops. Then one day I timed it and the for loop was 10x+ faster than any apply/plyr function including using a gpu.The things that separate a programming language from a statistical language are a programming language have more than one of these:* Good dependency management* Easy deployment into production environment* A clear way to setup environment (e.g. naming, folder conventions)* Ability to do most of the things you want with the base packages* Good documentation about the above.Basically, I believe a good data scientist is someone who can use R (or something else) to explore data and then create the algorithm in a compiled language to be put in production. And for someone who just needs to create analysis for research or a paper, R is the perfect use case. R is an excellent language for its use cases, just don't think about using it for general programming. It has caused a lot of extra dev hours working on issues with it.Little plug, we wrote a piece on hiring data scientists.[0][0]: <a href="https://gastrograph.com/blogs/gastronexus/interviewing-data-science-interns.html" rel="nofollow">https://gastrograph.com/blogs/gastronexus/interviewing-data-...</a>

评论 #11113256 未加载

th0ma5超过 9 年前

My own personal rant, I think the specific feeling I get is the conceptual idea of R has long since outpaced the reality of R.People like to fetishize data, and R sure lets you do that. The data science landscape however is growing such that R is really just a one-trick pony, however, that one trick is for better or worse being the gold standard of statistics and modeling, somehow.But everything else wants to sugar coat the software surrounding the statistics, and leaves you no room to grow.This is a very bad over-simplified example, but you sort of can't learn much about graphic design or good communication skills by using ggplot2 ... you can make something look very very nice, hopefully, in the general case, sure. And you can definitely do all kinds of hacks and crazy code to make it do whatever you want, but by doing that you produce ever more fragile and environment dependent code. You'd be better off learning just about anything else for graphics (Straight SVG, D3, Processing, Cairo directly, etc) because it is of course a bit more of a problem starting up, but a generalized skill set that could allow you to grow.You also learn pretty much nothing about web development from Shiny. Shiny is a wonderful idea, but ultimately prevents a statistician from implementing what it promises, which is an analytic application. At some point, you have to ditch it and learn more traditional web stacks. It is also something of a sales funnel into a server solution that's a DDOS or security nightmare just waiting to happen.So instead of just griping, I guess I have some ideas... it would be nice to have a Ruby/JS/Java/Python service generator. It would be nice to have a D3/React/whatver based generator. It would be nice for there to be a data munging solution (or even whole models, more like more PMML type stuff) that can be generalized into something that could be compiled or generates Python/Java/Bash/JS/Whatever code.Ultimately you start thinking along those lines, and you realize that the promises R is making about empowering the analyst are just teasing them rather than helping.R could do with less magic and more concentration on being simply a great statistics engine that integrates better. I guess it is that to some degree, but it sure fails the rest of the technology world that tries to live with it.

评论 #11113445 未加载

评论 #11114546 未加载

评论 #11116108 未加载

DangerousPie超过 9 年前

I have my fair share of problems with R, but that first example (4 ways to select a column) seems a bit silly. Just off the top of my head, I could think of plenty of ways to do the same thing in Python/pandas:<pre><code> ice_cream.icol[0] ice_cream['col'] ice_cream.iloc[:, 0] ice_cream.loc[:, 'col'] ice_cream.ix[:, 'col'] </code></pre> And if you wanted to make things more convoluted, you could also wrap things into lists like the author did in the R example. So this is definitely not a problem that is unique to R or any reasonably flexible language.

评论 #11114280 未加载

Gatsky超过 9 年前

Ok, but this seems pretty trivial compared to the many exclusive advantages R has. I've had minimal problems using and extending other people's software packages written in R (for bioinformatics). This has definitely not been the case with Java, R or perl, where just installing said software package is often unusually painful or impossible.I think R is a prime example how useful a domain specific language can be. As such, I see Julia as the most viable replacement, although that will take a long, long time.

stevetrewick超过 9 年前

So, as a code person rather than a stats one, my first reaction was that in the first example there is in fact only a single way to access a column but multiple ways to specify which one, all of which made immediate intuitive sense to me.So I wonder if this is less about R specifically and more a feature of people approaching a language (any language) without that code geek intuition for the underlying affordances ?

superuser2超过 9 年前

Most of my university classmates' first exposure to programming is using R in a statistics class. It's awful. I wish they'd make Python or something a prerequisite, so that giant swaths of people don't get turned off of computing or start with the strange ideas it teaches.

评论 #11113940 未加载

评论 #11113941 未加载

pak超过 9 年前

There are certain languages that are good for a first-time programmer.R, despite being one of the first languages a budding "data scientist" might want to use, is probably not one of them for the many reasons given, among them:- there are way too many ways to do everything- implicit iteration (although great for statistics) makes performance issues hard to spot- the data structures are a bit too flexible (it is Lisp-y in places), and you really need to understand them all to deploy the *apply and plyr functions effectively- 3+ object-oriented programming systems- non-standard evaluation. It's all over popular libraries like ggplot2, because it increases terseness, but it just looks like magic to beginners.Basically, all the chapters listed here [0] -- which happens to be a great guide for experienced programmers to really understand R as a language -- happen to be the same reasons beginners give up too quickly.Python, although it sufficiently nags me with its one-way-to-do-it motto and its many warts [1] to not want to use it regularly, is just well-rounded enough that it is a much better language for beginners. With Anaconda and iPython installed, I've found that a total programming beginner can actually get productive pretty quickly, even on stats and math problems.[0]: <a href="http://adv-r.had.co.nz/" rel="nofollow">http://adv-r.had.co.nz/</a>[1]: <a href="https://wiki.python.org/moin/PythonWarts" rel="nofollow">https://wiki.python.org/moin/PythonWarts</a>

mikeskim超过 9 年前

This could change your life: <a href="http://adv-r.had.co.nz/" rel="nofollow">http://adv-r.had.co.nz/</a>

dmlorenzetti超过 9 年前

The upshot is that unless you carefully read the apply() documentation..., you’re hosed.One thing that jumps out at me, having returned to R after several years in the Python world, is how obtuse its documentation can be.The standard format for R documentation does a few things that I find impede understanding. First, the help pages are organized into sections giving the high-level description, the arguments, the details, and the results ("values"). The "details" generally are organized by argument keyword, and the arguments section draws on the language laid down-- usually in a vague, high-level way-- by the description section. Finally the practical effects of the details are deferred till the results section. That means unless you already know what's going on, you end up having to jump around among sections, trying to synthesize everything.This is particularly a problem for those help pages-- and there are a lot of them-- that describe a raft of related functions all at the same time. Describing a bunch of related functions in the same place sounds like a good idea (it should help you figure out `apply` vs `sapply`, right?). Yet this is exactly when the documentation organization results in the most scattershot reading, because in addition to having to synthesize between sections, you have to mentally prune away text that, for one reason or another, doesn't apply to your particular case (for example, because different functions don't all share the same arguments, or because you want to read about the values for just one variation on the function).Another idiom I dislike in the standard R documentation is how the examples don't actually show any sample output. There are generally some attempts at comments to explain what the sample code should or shouldn't do, but they are very much written in the style of programmer's comments, not in the style of documentation or learning points. So you end up having to run the code, and sometimes puzzle over the results for a while.Here's an example, from the help page that I happen to have open right now, `help(sample)`:<pre><code> # sample()'s surprise -- example x <- 1:10 sample(x[x > 8]) # length 2 sample(x[x > 9]) # oops -- length 10! sample(x[x > 10]) # length 0 </code></pre> The comments alert me that there's a "surprise" in store, and they even allude to the (apparently surprising) fact that the second line produces a 10-vector. Notably lacking is any explanation of what's meant to be surprising here, how that relates to the internal logic of `sample`, or how to avoid falling into the trap.Overall, I feel like R's documentation is a bit like a conversation among experts, with a rather sink-or-swim attitude towards newcomers.Documentation is far from the first thing that stands out about R vs Python, but it's the most salient, I think, in the context of the original article.

评论 #11113453 未加载

elcapitan超过 9 年前

Not intending to start a language war here, but if somebody who has experience with both R and Python/pandas/etc could answer - how's the current state of the emerging Python data/statistics ecosystem compared to R? (not counting all the other differences like R being allegedly weird or Python more general purpose and so on).

评论 #11113744 未加载

评论 #11116120 未加载

joelberman超过 9 年前

I think if you are a programmer or have some programming language experience, R is not very weird. But if you are a financial analyst or a social scientist, or a statistician, and only want to get your work done; it depends on your first programming language. If it was S3 you are golden. If it was Basic, you are not so golden. Mine was LISP.

Bluestrike2超过 9 年前

I just introduced a friend of mine to R. He's working on his PhD in microbiology and beside himself once he started working with R. Personally, I can't believe he hadn't used it before. It really is a beautiful language to work with once you get a handle on it.

stevehiehn超过 9 年前

The only issue I have with R is when exposing it as a web service R is not great. For example if using R you will need one container for "dployr" and another for your web service. It's not the end of the world but more moving parts means more problems.

rcthompson超过 9 年前

Maybe R should have optional "training wheels" that produce a warning every time an implicit conversion happens. In the OP's case, it would warn that a data frame was implicitly being converted to a matrix, and maybe also warn that the numeric vectors within were being converted to character in order to get slotted into that matrix.

haddr超过 9 年前

R is great language but at the same time it can be a real pain.Sometimes I imagine that some very wise guy designs a language much more consise and coherent, that could at the same time take advantage of the huge number of existing libraries written in R and C++... Maybe it's a dream but so many times I wonder if that's even be possible.

numlocked超过 9 年前

At my previous job we used to play "Guess what R does" over lunch. Someone would write a few R statements and we'd have to guess the output. Extremely difficult!<pre><code> >> a=c(1,2,3,4) >> b=c(1,2) >> a+b </code></pre> Any guesses?

评论 #11114384 未加载

评论 #11114337 未加载

makeset超过 9 年前

Yes, I'm well aware of R's many faults, I have my own long list of R caveats I hand to new hires, but not bothering to learn the damn language is no reason to complain about it. First, RTFM.

评论 #11114592 未加载

gradstudent超过 9 年前

ITT: people who don't understand R complain about R.

tempodox超过 9 年前

Ah, the joys of a dynamic language and its implicit conversions.

misiti3780超过 9 年前

previous discussion: <a href="https://news.ycombinator.com/item?id=5450097" rel="nofollow">https://news.ycombinator.com/item?id=5450097</a>