My SO, is currently a statistician/data research, she primarily that works in excel/vba/sql.<p>As she works in the public sector, we would like to broaden her employment prospects and looking at most J/Ds for Data Scientists or similar stats based roles programming is a must.<p>So which is the most suitable language to learn? Python? R?<p>TLDR; Data Scientists, what did you start out learning? What worked? What would you do differently?
As other said here, Python seems like the good choice; Plus it's a "real programming language". I mean, you can use it for more stuff: web programming, web scrapping, etc. There are a lot of libraries, even for game development.<p>If you can learn more than one I would recommend learning R as well, rather than other technologies such as octave or matlab..
I'd go with Python and specifically the scientific + numerical libraries. There are books like "Python for Data Analysis" from O'Reilly.<p>Five years ago I would have said "R", but Python enthusiasts have been replicating what R does in Python at a frantic pace, and R will never really replicate what Python brings to the table.
As a statistician I spend ca. 80% of my time collecting and transforming data. Maybe because I never had the luxury having 1-2 own data engineers doing that for me. Having worked with SAS, SPSS, Matlab and python and tried some other tools I would say that the choice of statistical programming language does not make a big difference. If you once understand a modelling process you can reproduce/use it in any language as long as there is a documented package for it.<p>On the other side knowing how to work with data is IMHO more important. Being an SQL pro, knowing how to think in data sets instead of data records, when to use flat tables, how to use vectorization and matrix manipulation even for all day tasks especially in "in memory" systems is essential.<p>I would say SQL + R/python makes a good combination. With that you can solve a lot of problems at least two different ways. R gets integrated step by step in DWHs, what makes a lot easier. I hope SAS dies a short and painful death, but could be also a valid choice.
I'd note that SAS is still used in the public sector and enterprise to a larger degree then you'd think, just because of legacy usage. For example, the FDA had to clarify a few years ago that R was okay, because their regulations required the use of SAS5 formatted data for electronic submissions, and when I worked in a county health department, we used SAS.<p>Perl was also the bioinformatics golden child for a while, and I expect there are still people using it for that purpose in industry.<p>That said, looking beyond the public sector I'd look at Python as broadening her prospects more then anything else, just because it's more broadly used in a variety of industries, and general understanding of it is more broadly applicable.<p>I'd suggest she also learn enough R to import stuff and export stuff. If there are R scripts she needs to use, python can be used to script R from the command line, and she can import data, process it, and be able to export the results back into python as an intermediate step.
From a statistics perspective R is the language to learn.<p>Python is good for data engineering or pipelining, etc - but R is the best for analysis:<p>- Rstudio is a much more friendly interface than IPython/Jupyter notebooks<p>- Python's visualization libraries can't come close to ggplot2<p>- Python lacks an effective grammar of data manipulation better similar to dplyr or magrittr.<p>I think HN is more engineering focused, hence increased exposure to Python. At the places I've worked/interviewed for data science, 1 was full Python (though they have a high eng bar for data scientists, and very few data engineers), and the rest had a reasonable split of R and Python. Your SO will be fine either way but might find R more intuitive and better suited to statistics work.
Dedicated probabilistic modeling environments are also gaining ground and may become standard in the near future.<p>The holy grail is something like: feed in some data, or parameters and have an algorithm generate the corresponding correct Bayesian inference and posterior distribution. It's very easy for scientists, even with years of knowledge and experience, to implement things incorrectly ;)<p>Check out Stan and Figaro:<p><a href="http://mc-stan.org/" rel="nofollow">http://mc-stan.org/</a><p><a href="https://www.cra.com/technical-expertise/probabilistic-modelingprogramming" rel="nofollow">https://www.cra.com/technical-expertise/probabilistic-modeli...</a>
Python or R, or a combination of both:
<a href="https://en.wikipedia.org/wiki/R_(programming_language)" rel="nofollow">https://en.wikipedia.org/wiki/R_(programming_language)</a>
R is what companies that are leaving SPSS and SAS are switching to. While I also like Python for Data Science, R seens to be more popular in the industry.
I'd also recommend you to have a look at F#. It's functional language so it might be more intuitive for mathematician to learn. And it can also call R functions, so you can use all the R statistical function + packages.