I would recommend first of all that you get familiar with the names and versioning. Hadoop is a mess, and its basically a family of big data projects.<p>You have two/three main components in hadoop:<p>- Data nodes that constitute HDFS. HDFS is Hadoop's distributed file system, which is basically a replicated fs that stores a bunch of bytes. You can have really dumb data (lets say a bunch of bytes), compressed data (which saves space but depending on the codec you may need to uncompress the whole file just to read a segment), arrange data in columns, etc. HDFS is agnostic of this. This is where you hear names like gzip, snappy, lza, parquet, ORC, etc.<p>- Compute nodes which run tasks, jobs, etc depending on the framework. Normally you submit a job which is composed of tasks that run on compute nodes that get data from hdfs nodes. A compute node can also be an hdfs node. There are alot of frameworks on top of hadoop, what is important is that you know the stack (ex: <a href="https://zekeriyabesiroglu.files.wordpress.com/2015/04/ekran-resmi-2015-04-23-08-34-00.png" rel="nofollow">https://zekeriyabesiroglu.files.wordpress.com/2015/04/ekran-...</a>). So you have HDFS, and on top of that you (now) have YARN which handles resource negotiation within a cluster<p>- Scheduler/Job runner. This is kinda what YARN does (please someone correct me). Actually its a little more complicated <a href="https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html" rel="nofollow">https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yar...</a><p>Since hadoop jobs are normally a JAR, there are several ways of creating a jar ready to be submitted to an hadoop cluster:<p>- Coding it in java (nobody does it anymore)
- Writing in a quirky language called Pig
- Writing in an SQL-like language called HiveQL (you first need to create "tables" that map to files on HDFS)
- Writing generic Java framework called Cascading
- Writing jobs in scala in a framework on top of cascading called Scalding
- Writing in clojure that either maps to pig or to cascading (Netflix PigPen)
- ...<p>As you can imagine, since HDFs is just an fs, there are other frameworks that appeard that do distributed processing and that can connect to hdfs in someway:
- Apache Spark
- Facebook's Presto
- ...<p>And since there are so moving parts, there's alot of components to put and get data on hdfs, or nicer job schedulers, etc. This is part of the hadoop ecosystem: <a href="http://3.bp.blogspot.com/-3A_goHpmt1E/VGdwuFh0XwI/AAAAAAAAE2w/CKt2D2xmRkw/s1600/EcoSys_yarn.PNG" rel="nofollow">http://3.bp.blogspot.com/-3A_goHpmt1E/VGdwuFh0XwI/AAAAAAAAE2...</a><p>Back to your question, I suggest you spin your own cluster (this one was the best I found: <a href="https://blog.insightdatascience.com/spinning-up-a-free-hadoop-cluster-step-by-step-c406d56bae42#.z6g3j1trf" rel="nofollow">https://blog.insightdatascience.com/spinning-up-a-free-hadoo...</a>) and run some examples. There's alot of details about hadoop such as how to store the data, how to schedule and run jobs, etc but most of the time you are just connecting new components and fine-tuning jobs to run as fast as possible.<p>Make sure you don't get scared by lots of project names!