Matlab, R, and Julia: Languages for data analysis

123 pointsby bugsbunnyakover 12 years ago

16 comments

lsiebertover 12 years ago

You know what will be popular? whatever runs reasonably fast and helps you import and clean data quickly from a variety of sources.Because the analysis is often the quickest part of being a data scientist. Coursera, as I recall, apparently cleans it's data, and also lets you easily import it.In real life, data is messy, and messed up. You looking at birthdays from some website? expect a spike for whatever the default is... but that doesn't mean you can eliminate that data completely, because some people were presumably born Jan 1st.You looking at birth years? I recall dealing with them in SAS... remember if it's four digit that you check for births occurring in the current and past century.And hey... do you have two or more elements of data for an individual? 2% to 5% will probably be missing some element, and some will have wrong data. a zip code off by one, an address not in the city you are looking to geocode for, whatever. If you are lucky, it will be obvious stuff like that.The life if the data scientist is mostly cleaning, formatting, and transferring data, with the occasional sweeeet analysis. Of course your analysis will probably give you nothing useful, because despite several thousand usable records, it's not clear if any element has a significant effect on the dependent variable you are looking at. If you are smart, maybe you can finagle an analysis based on a non parametric distribution or logistic regression.Oh, and often the speed of your analysis running is inversely correlated with how easy it is to code and enter your data. There is a reason people use SAS, and it's not because of it's amazing IDE.

mark_l_watsonover 12 years ago

I wouldn't be surprised if Octave (open source version of Matlab) doen't become very popular because a lot of Coursera classes use it for homework assignments.I thought that Octave was an ugly little language at first, now I really like it - a great tool for doing linear algebra, data visualization, machine learning, neural networks, etc.

评论 #4661188 未加载

评论 #4663817 未加载

评论 #4662070 未加载

pav3lover 12 years ago

Here is a nice 4-year old still active discussion on pros and cons of different data analysis technologies: <a href="http://brenocon.com/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-stata/" rel="nofollow">http://brenocon.com/blog/2009/02/comparison-of-data-analysis...</a>

评论 #4662713 未加载

评论 #4661542 未加载

travisoliphantover 12 years ago

I think the article tagline would be better "Domain Specific Languages for data analysis". Fortunately, the article does mention Python which is critical because new people might not recognize just how prevalent Python is for solving data analysis problems after reading this. The great work of the SciPy community has enabled Python to be used for all of the things that Matlab, R, and Julia can do. In addition, Python can integrate easily with these languages, so if you are a data analyst you need to learn Python.

评论 #4661339 未加载

scottfrover 12 years ago

Personally I'm in love with R's data.frame. It allows very concise, robust and elegant manipulation and subsetting of a data set.I wish every language would have such a built-in object type, I definitely feel its loss when I manipulate data in other languages such as Javascript or Mathematica.

评论 #4661105 未加载

评论 #4661319 未加载

tikhonjover 12 years ago

I wonder if there is room for some smaller languages optimized specifically for data analysis. In particular, I wonder how a carefully designed non-Turing-complete language would fare.That would be a really cool project to work on: design a minimal language for expressing most types of data analysis at a higher level. If the language is sufficiently small and simple, I could see some very powerful tooling being possible for it.Perhaps it might make sense to go even more specific: have a small language designed not just for data analysis but for analysis in a very specific vertical (say finance or bioinformatics). It would be awesome to let people express their ideas in terms of the domain and not worry about low-level details like loops.

评论 #4661070 未加载

评论 #4661314 未加载

评论 #4662726 未加载

评论 #4661639 未加载

评论 #4661308 未加载

评论 #4663119 未加载

dbeckerover 12 years ago

When introducing python, the author writes "Despite the obvious advantages of MATLAB, R, and Julia, it’s also always worth considering what a general-purpose language can bring to the table."Even with thousands of hours of experience in Matlab, R and Python... I'm not sure what "obvious advantage" Matlab and R share over Python.

评论 #4662770 未加载

评论 #4663322 未加载

评论 #4661469 未加载

评论 #4662601 未加载

lorenzfxover 12 years ago

python fanboy here: "[python is] not as tuned to numerics as MATLAB": if you build numpy with ATLAS there is, in my experience, hardly ever any noticeable speed difference between numpy and MATLAB

评论 #4660951 未加载

评论 #4661186 未加载

myspyover 12 years ago

I have to create figures with Matlab and that's a pain in the ass. Changing XTickLabels, kills another part of the figure, and in general it's very hard to do a little more with figures.But the basic data analysis is fine. The IDE has awful code completion and lacks more refinement in the editor.

评论 #4662305 未加载

StefanKarpinskiover 12 years ago

This is a really excellent and well-balanced article. Very much captures the pluses and minuses of these various systems for data analysis.

rcthompsonover 12 years ago

One of my bioinformatics courses "required" MATLAB because the class project was based on a simulation framework called the COBRA Toolbox which was developed in MATLAB[1]. I didn't know who to ask about obtaining a MATLAB license, so instead I just got it to work in Octave and used that. I was pleasantly surprised at how little I had to tweak before the framework just worked in Octave, given that as far as I know everyone in the lab that develops the framework just uses MATLAB.[1] <a href="http://opencobra.sourceforge.net/openCOBRA/Welcome.html" rel="nofollow">http://opencobra.sourceforge.net/openCOBRA/Welcome.html</a>

prakashkover 12 years ago

Perl was mentioned in the article, but PDL (Perl Data Language) wasn't.<a href="https://metacpan.org/module/PDL" rel="nofollow">https://metacpan.org/module/PDL</a>PDL is the Perl Data Language, a perl extension that [...] includes fully vectorized, multidimensional array handling, plus several paths for device-independent graphics output.PDL is fast, comparable and often outperforming IDL and MATLAB in real world applications. PDL allows large N-dimensional data sets such as large images, spectra, etc to be stored efficiently and manipulated quickly.For integration with R, there are Statistics::R (<a href="https://metacpan.org/module/Statistics::R" rel="nofollow">https://metacpan.org/module/Statistics::R</a>) and Statistics::useR (<a href="https://metacpan.org/module/Statistics::useR" rel="nofollow">https://metacpan.org/module/Statistics::useR</a>)

elchiefover 12 years ago

Everyone loves to shit all over Java, but Mahout, RapidMiner, Weka, Hive, HBase are all written in it.

评论 #4662933 未加载

tvladeckover 12 years ago

Thought I'd ask since I'm learning Clojure - are there experiences worth sharing re: using Incanter in these types of settings?

评论 #4663518 未加载

agentqover 12 years ago

no love for J?

评论 #4664068 未加载

zemover 12 years ago

surprising omission at the end - any mention of scipy should at least include a pointer to sage as well.