I have started to take some basic level data science courses on edx.org. Then I came across Hadoop and I would really love to learn this. I have the following questions and I would really appreciate if you can help me with this:<p>1. What is the best source to start learning Hadoop? I was thinking of starting with Udacity or Big data university.<p>2. Do I need Linux to run Hadoop? I am having wifi issues even after I did the driver upgrade.<p>3. In order to be employed, do I need to learn the entire system or just one portion of it like spark, hive or pig?<p>Please advise.
Caveat: This is random advice from the internet.<p>1. If it were me, I'd start by installing Hadoop on a laptop since Googling indicates it's doable....for some definition of 'doable.' Even if I could not get it to work, reading the documentation and researching whatever problems I encountered would deepen my practical knowledge. Getting Hadoop up and running is also a facet in a practical working definition of 'knowing Hadoop.'<p>2. Linux Wireless driver BLOB's have been a source of pain for me. The work arounds for me have been:<p>a. Purchase well supported hardware, e.g. used Thinkpad and cards without obscure Broadcom chips.<p>b. Use an external wireless router and an ethernet cable. That's how I connect desktops and laptops around the office.<p>3. My gut is that the important knowledge for many positions requiring or preferring Hadoop will be more related to data science than technical expertise. On the other hand, looping back to my earlier advice, positions that are Hadoop first rather than data-science first will benefit from an operational understanding.<p>Lastly, what I've been hearing about the industry, is that 'embarrassingly parallel workloads that can take full advantage of Hadoop are not as common as was thought a few years ago. The big useful innovation of Hadoop is looking like the underlying Hadoop Distributed File System (HDFS) and other big data search/query tools are being built over it.<p>That's not to say Hadoop is dead or not worth exploring, particularly at the technical level of HDFS and in terms of applying data-science concepts. Learning Pig or Hive makes sense in service of learning how to apply data science concepts. Because Hive is based on SQL it is probably the more generalizable skill...and learning SQL is probably more useful than learning either in terms of employment.<p>Good luck.
I learned Hadoop in Grad school in 2013. If you can spend a little bit of cash, get some VMs on AWS, and follow one of the many guides out there (for example, Cloudera) to install Hadoop. Should be enough to build something like: <a href="http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/" rel="nofollow">http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data...</a>.<p>I started out trying my VMs on virtualbox, then a couple of different laptops at home, etc, but AWS was the easiest setup in the end.
There's also <a href="http://www.cloudera.com/training.html" rel="nofollow">http://www.cloudera.com/training.html</a><p>You can run Linux in a virtual machine (VirtualBox, VMware etc) where you wouldn't have to deal with wifi drivers because it uses the existing network connection from the host operating system.
There's a couple of hints to books in <a href="https://news.ycombinator.com/item?id=12389595" rel="nofollow">https://news.ycombinator.com/item?id=12389595</a>