You know what will be popular? whatever runs reasonably fast and helps you import and clean data quickly from a variety of sources.<p>Because the analysis is often the quickest part of being a data scientist. Coursera, as I recall, apparently cleans it's data, and also lets you easily import it.<p>In real life, data is messy, and messed up. You looking at birthdays from some website? expect a spike for whatever the default is... but that doesn't mean you can eliminate that data completely, because some people were presumably born Jan 1st.<p>You looking at birth years? I recall dealing with them in SAS... remember if it's four digit that you check for births occurring in the current and past century.<p>And hey... do you have two or more elements of data for an individual? 2% to 5% will probably be missing some element, and some will have wrong data. a zip code off by one, an address not in the city you are looking to geocode for, whatever. If you are lucky, it will be obvious stuff like that.<p>The life if the data scientist is mostly cleaning, formatting, and transferring data, with the occasional sweeeet analysis. Of course your analysis will probably give you nothing useful, because despite several thousand usable records, it's not clear if any element has a significant effect on the dependent variable you are looking at. If you are smart, maybe you can finagle an analysis based on a non parametric distribution or logistic regression.<p>Oh, and often the speed of your analysis running is inversely correlated with how easy it is to code and enter your data. There is a reason people use SAS, and it's not because of it's amazing IDE.
I wouldn't be surprised if Octave (open source version of Matlab) doen't become very popular because a lot of Coursera classes use it for homework assignments.<p>I thought that Octave was an ugly little language at first, now I really like it - a great tool for doing linear algebra, data visualization, machine learning, neural networks, etc.
Here is a nice 4-year old still active discussion on pros and cons of different data analysis technologies: <a href="http://brenocon.com/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-stata/" rel="nofollow">http://brenocon.com/blog/2009/02/comparison-of-data-analysis...</a>
I think the article tagline would be better "Domain Specific Languages for data analysis". Fortunately, the article does mention Python which is critical because new people might not recognize just <i>how</i> prevalent Python is for solving data analysis problems after reading this. The great work of the SciPy community has enabled Python to be used for <i>all</i> of the things that Matlab, R, and Julia can do. In addition, Python can integrate easily with these languages, so if you are a data analyst you need to learn Python.
Personally I'm in love with R's data.frame. It allows very concise, robust and elegant manipulation and subsetting of a data set.<p>I wish every language would have such a built-in object type, I definitely feel its loss when I manipulate data in other languages such as Javascript or Mathematica.
I wonder if there is room for some smaller languages optimized specifically for data analysis. In particular, I wonder how a carefully designed non-Turing-complete language would fare.<p>That would be a really cool project to work on: design a minimal language for expressing most types of data analysis at a higher level. If the language is sufficiently small and simple, I could see some very powerful tooling being possible for it.<p>Perhaps it might make sense to go even more specific: have a small language designed not just for data analysis but for analysis in a very specific vertical (say finance or bioinformatics). It would be awesome to let people express their ideas in terms of the domain and not worry about low-level details like loops.
When introducing python, the author writes "Despite the obvious advantages of MATLAB, R, and Julia, it’s also always worth considering what a general-purpose language can bring to the table."<p>Even with thousands of hours of experience in Matlab, R and Python... I'm not sure what "obvious advantage" Matlab
and R share over Python.
python fanboy here:
"[python is] not as tuned to numerics as MATLAB": if you build numpy with ATLAS there is, in my experience, hardly ever any noticeable speed difference between numpy and MATLAB
I have to create figures with Matlab and that's a pain in the ass. Changing XTickLabels, kills another part of the figure, and in general it's very hard to do a little more with figures.<p>But the basic data analysis is fine. The IDE has awful code completion and lacks more refinement in the editor.
One of my bioinformatics courses "required" MATLAB because the class project was based on a simulation framework called the COBRA Toolbox which was developed in MATLAB[1]. I didn't know who to ask about obtaining a MATLAB license, so instead I just got it to work in Octave and used that. I was pleasantly surprised at how little I had to tweak before the framework just worked in Octave, given that as far as I know everyone in the lab that develops the framework just uses MATLAB.<p>[1] <a href="http://opencobra.sourceforge.net/openCOBRA/Welcome.html" rel="nofollow">http://opencobra.sourceforge.net/openCOBRA/Welcome.html</a>
Perl was mentioned in the article, but PDL (Perl Data Language) wasn't.<p><a href="https://metacpan.org/module/PDL" rel="nofollow">https://metacpan.org/module/PDL</a><p><i>PDL is the Perl Data Language, a perl extension that [...] includes fully vectorized, multidimensional array handling, plus several paths for device-independent graphics output.</i><p><i>PDL is fast, comparable and often outperforming IDL and MATLAB in real world applications. PDL allows large N-dimensional data sets such as large images, spectra, etc to be stored efficiently and manipulated quickly.</i><p>For integration with R, there are Statistics::R (<a href="https://metacpan.org/module/Statistics::R" rel="nofollow">https://metacpan.org/module/Statistics::R</a>) and Statistics::useR (<a href="https://metacpan.org/module/Statistics::useR" rel="nofollow">https://metacpan.org/module/Statistics::useR</a>)