What's wrong with statistics in Julia?

132 pointsby benhamnerover 10 years ago

8 comments

vegabookover 10 years ago

Is Julia seriously targeting the R user base? Or, if it is honest with itself, is it going after the matlab people first. My sense is that engineers (Matlab users) and to a certain extent scientists (Python rules) will be drawn in, but the stats crowd requires subtly different priorities, that seem to be alluded to here. Graphics are one such priority. The excellent ggplot2 gets all the glory, but base graphics are mega-fast, extremely robust, and deeply customizable, and there is the unsung hero of grid graphics which provides an extraordinary set of abstractions for building things like ggplot2 and/or Lattice. My point is that so much graphical quality speaks directly to one of the key requirements of statisticians, where at the end of the analysis, communication is usually required. This is much less the case for engineers or financiers (big matlab users) for example, where correct and fast answers are the endpoint. Where is Julia on graphics? Last time I checked it was still trying to interface to R and/or Matplotlib.The other thing that intrigues me is Julia's scalar computations being "at least" as important as vectors. This has the whiff of For loops (an immediate eye-roller for R people) accustomed to vectorization everywhere and essentially, exclusively. I am not suggesting that Julia doesn't do vectors well, just that, like any set of priorities, it is not catering first for statisticians, whose requirements are often quite different from those of scientists and engineers who use Matlab and Python.

评论 #8675521 未加载

评论 #8675989 未加载

评论 #8675532 未加载

nkurzover 10 years ago

I found some development discussion about the changes to missing values here: <a href="https://github.com/JuliaLang/julia/issues/8423" rel="nofollow">https://github.com/JuliaLang/julia/issues/8423</a>I've yet to try out Julia, but seeing productive and intelligent discussion like this followed quickly by execution certainly inspires confidence that it's a language with considerable promise.

username223over 10 years ago

> In a future blog post, I’ll describe how Nullable works in greater detail.This is the part that interests me. If you're not using sentinel values like NaN, then it seems like you're left with pointers (terrible) or tags and tag tests (also pretty bad). If Julia can't use the processor's SIMD instructions (or the GPU in the near future), it's not suitable for inner loops. Do you special-case "Nullable{Double}" to use NaN as its "NA" value?

评论 #8675812 未加载

peatmossover 10 years ago

I've got a perhaps naïve question related to:> In particular, I’d like to replace DataFrames with a new type that no longer occupies a strange intermediate position between matrices and relational tables.Namely, why is an embedded SQLite database not used for all things tabular in languages like R/Julia/Foo? I was thinking about this as I was attempting to reconstruct a visualization in Racket using their (pretty good!) 2d plotting facilities and lamenting not having a tabular data structure.SQLite is embeddable. It has fast in-memory database support. It can operate reasonably quickly on data that is stored to disk. It supports indexing. NULL values already exist to represent missingness. SQLite allows for callbacks to user-supplied functions that I'd imagine could be created relatively easy in something like Julia.As a side benefit, it seems like a SQLite-oriented tabular data store could be extended, like dplyr has done, to support other databases.When I think about the use cases I've encountered where I've found myself reaching for DataFrame or data.frame, I am struggling to think how a tightly integrated SQLite wouldn't work.Are there Computer Science reasons why this is a silly idea? I know Pandas claims nominally better performance than SQLite in some circumstances, but then again SQLite has also recently seen some substantial performance gains.

评论 #8681336 未加载

评论 #8677204 未加载

howemanover 10 years ago

I'd be interested in seeing things you think are done correctly as well. We're working to build a statistics package in go as well (github.com/gonum/stat). It's good that people are taking a fresh look. Our capabilities are clearly limited at the moment, and the features of Go are quite different from those of Julia, but there's still a lot that can be learned in common.

评论 #8677264 未加载

nyirover 10 years ago

Isn't the rather big flexibility of R to not eagerly evaluate, or to rewrite function arguments entirely (and apparently scope manipulation, that's a new one for me) one of the major points of critic? It always seemed to me that having "proper" macros is a selling point rather than only an approximation/emulation.

评论 #8675482 未加载

评论 #8675431 未加载

lottinover 10 years ago

One of Julia's main selling points has been speed. They said there was no reason a dynamic language had to be slow, but now we see that as soon as they start adding basic features, such as support for missing values, it's beginning to take a toll on performance. I wonder if Julia 1.0 will be any faster than R or Python.

评论 #8676579 未加载

评论 #8677193 未加载

gajomiover 10 years ago

I am a big fan of much of the work going on with statistics in Julia.I'd like to point out another (related) thing though that is wrong with the state of statistics in Julia, in my opinion. Actually my problem has less to do with statistics, per se, than the mathematical foundations of statistical calculations. There are, of course, many different approaches to statistical inference (the Bayesian vs. frequentist camps infamous among these), but the calculations all come down to reasoning about probabilities which is a well posed but not always easy task. The Julia developers, in their wisdom, have recognized this, and as such have put together Distributions.jl the purpose of which is to provide datatypes and utility functions over probability distributions (plus some vestigial methods about maximum likelihood inference, which thankfully stay out of the way). If you haven't seen it, check it out. It's got a nice design, I think.But there is presently a clear server limitation: the parameterized type hierarchy. The requirement is that every distribution have support over a set in which all members are either Univariate, Multivariate or Matrixvariate with elements in the fields being either Discrete or Continuous. This obviously misses the general picture of the kinds of sets from which one draws random variables, which plays into issues that the article mentions (if the probability distribution datatype can't model the data you have to do ad hoc things to account for it). Indeed a huge chunk of the issues currently open in the Distributions github page essential boil down to problems with representing the sets and/or spaces from which elements in the distribution are drawn:<a href="https://github.com/JuliaStats/Distributions.jl/issues/147" rel="nofollow">https://github.com/JuliaStats/Distributions.jl/issues/147</a> <a href="https://github.com/JuliaStats/Distributions.jl/issues/309" rel="nofollow">https://github.com/JuliaStats/Distributions.jl/issues/309</a> <a href="https://github.com/JuliaStats/Distributions.jl/issues/224" rel="nofollow">https://github.com/JuliaStats/Distributions.jl/issues/224</a> <a href="https://github.com/JuliaStats/Distributions.jl/issues/283" rel="nofollow">https://github.com/JuliaStats/Distributions.jl/issues/283</a>End users have a variety of well developed ideas in mind about the sets that their samples belong to and even the spaces from which they are drawn from which currently exceed the representational capacity of the existing types. In my opinion the way to fix this is to start a separate library that focuses on types and methods for representing and manipulating sets and spaces (i.e. topological information attached to sets). This could then be consumed by the Distributions people as well as others modeling things outside of probability.