TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Best way to start and master Hadoop?

49 pointsby hubatrixover 8 years ago
I am starting to learn Hadoop and what are some of the good tutorials that can help me with it ?

13 comments

burgerdevover 8 years ago
Reconsider whether you really need to: <a href="https:&#x2F;&#x2F;www.chrisstucchio.com&#x2F;blog&#x2F;2013&#x2F;hadoop_hatred.html" rel="nofollow">https:&#x2F;&#x2F;www.chrisstucchio.com&#x2F;blog&#x2F;2013&#x2F;hadoop_hatred.html</a><p>(although it might look good on your CV)
评论 #12390151 未加载
评论 #12390021 未加载
评论 #12390390 未加载
评论 #12390527 未加载
heartsuckerover 8 years ago
I started with Hive and worked backwards. It gives you a nice SQL interface and allows you to do M&#x2F;R operations on CSV files easily. Once you get the hang of it, going back towards raw M&#x2F;R or even something like Cascading&#x2F;Scalding might be less of a shock.<p>If you know Cassandra or other NoSQL, you can try your hand at Hbase. To do anything with it beyond adding or removing data from a key, you&#x27;ll need to write an application of some sort. Cataloging tweets is a decently simple exercise.<p>In my work, the only time I accessed the HDFS directly was doing a put&#x2F;delete of a flat file CSV that I was going to load into Hive. I&#x27;m not saying there&#x27;s not use cases for using HDFS, just that in the set ups I&#x27;ve used, I&#x27;ve never seen it.
评论 #12390574 未加载
lifebeyondfifeover 8 years ago
First I&#x27;d ask what&#x27;s the skill you want to gain?<p>Do you want to learn how to setup Hadoop clusters and Zookeeper etc. on bare metal and understand all the maintenance of such a system. Or do you want to learn how to use the tool for enabling data science projects?<p>If it&#x27;s the former, companies like Databricks are becoming popular because they abstract a lot of that complexity away: <a href="https:&#x2F;&#x2F;databricks.com&#x2F;try-databricks" rel="nofollow">https:&#x2F;&#x2F;databricks.com&#x2F;try-databricks</a><p>Because you&#x27;re coming at it from scratch I&#x27;d strongly advise you look at the future trend and start straight with Spark. It is also made by Apache and is the next generation solution. <a href="https:&#x2F;&#x2F;spark.apache.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;spark.apache.org&#x2F;</a><p>To give an idea of the difference, Hadoop uses Map (transform) and Reduce (action) higher order functions to solve distributed data queries and aggregations. Spark can do the same but it has access to so many additional higher order functions as well. This makes the problem solving much more expressive. See the list of <a href="http:&#x2F;&#x2F;spark.apache.org&#x2F;docs&#x2F;latest&#x2F;programming-guide.html#transformations" rel="nofollow">http:&#x2F;&#x2F;spark.apache.org&#x2F;docs&#x2F;latest&#x2F;programming-guide.html#t...</a> and <a href="http:&#x2F;&#x2F;spark.apache.org&#x2F;docs&#x2F;latest&#x2F;programming-guide.html#actions" rel="nofollow">http:&#x2F;&#x2F;spark.apache.org&#x2F;docs&#x2F;latest&#x2F;programming-guide.html#a...</a><p>The Spark documentation and interpreter are good places to start.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;spark&#x2F;tree&#x2F;master&#x2F;examples&#x2F;src&#x2F;main&#x2F;scala&#x2F;org&#x2F;apache&#x2F;spark&#x2F;examples" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;spark&#x2F;tree&#x2F;master&#x2F;examples&#x2F;src&#x2F;mai...</a> (Scala) and <a href="https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;spark&#x2F;tree&#x2F;master&#x2F;examples&#x2F;src&#x2F;main&#x2F;python" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;spark&#x2F;tree&#x2F;master&#x2F;examples&#x2F;src&#x2F;mai...</a> (Python)
评论 #12391259 未加载
评论 #12392047 未加载
eranationover 8 years ago
Hadoop ecosystem is ptetty vast. I used the book &quot;Haddop the definite guide&quot; and &quot;Hadoop Security&quot;. But I see more and more organizations moving to solutions that augment Hadoop (or even replace it) such as Apache Spark. I&#x27;ve never seen a customer yet that would prefer a classic map reduce job over a spark app after understanding the performance and developer productivity benefits. But I might just be lucky. AWS EMR is as others said a great place to get started and you can get Hadoop, Hive, Spark, Presto and other great projects easily installed and ready to go.
评论 #12390581 未加载
acangianoover 8 years ago
Big Data University has several relevant courses and learning paths. They are all free, give you certificates upon completion, and open badges backed by IBM: <a href="https:&#x2F;&#x2F;bigdatauniversity.com&#x2F;courses&#x2F;" rel="nofollow">https:&#x2F;&#x2F;bigdatauniversity.com&#x2F;courses&#x2F;</a><p>In general, after you get some Hadoop fundamentals, I would recommend focusing on Apache Spark instead.<p>Disclaimer: I&#x27;m part of the Big Data University team.
评论 #12392060 未加载
lmmover 8 years ago
Start with a real problem that you need to solve. That&#x27;s the only way to learn something in a way that&#x27;s actually effective.
评论 #12390589 未加载
mastratton3over 8 years ago
That answer could go a number of directions depending on your level of experience. I would say find a problem you need a distributed system before and then as said previously, use AWS EMR. If you&#x27;re more interested in the infrastructure side of things than its always a good experience to setup a cluster from scratch.
binalpatelover 8 years ago
Start with a higher level interface and work backwards? Possibly something like Hive&#x2F;PIG&#x2F;MRJob, get familiar with them to wrap your head around MapReduce.<p>Past the scope of your question - but I&#x27;d also recommend learning Spark as well, it&#x27;s probably more relevant and marketable at this point than learning pure Hadoop.
评论 #12390593 未加载
thorinover 8 years ago
I was thinking about this today and stumbled across hortonworks sandbox which is a whole bunch of big data stuff including hadoop set up to run on a vm with a load of tutorials built in. Seems like it would be worth checking out!<p><a href="http:&#x2F;&#x2F;hortonworks.com&#x2F;products&#x2F;sandbox&#x2F;" rel="nofollow">http:&#x2F;&#x2F;hortonworks.com&#x2F;products&#x2F;sandbox&#x2F;</a>
master_yoda_1over 8 years ago
Hadoop is mostly used for processing large dataset. Also AWS EMR is the best place to play with large cluster. So start here: <a href="https:&#x2F;&#x2F;aws.amazon.com&#x2F;articles&#x2F;Elastic-MapReduce&#x2F;2273" rel="nofollow">https:&#x2F;&#x2F;aws.amazon.com&#x2F;articles&#x2F;Elastic-MapReduce&#x2F;2273</a>
opendomainover 8 years ago
I started an open source project for NoSQL training in all Big Data and machine learning technologies [1]<p>We have not done Hadoop yet - is there anyone that would like to help? We are considering crowdfunding to pay for all trainers so that the videos and book would be free - or should we charge per course?<p>[1] <a href="http:&#x2F;&#x2F;NoSQL.Org" rel="nofollow">http:&#x2F;&#x2F;NoSQL.Org</a>
codepieover 8 years ago
Hadoop&#x27;s source code has been on my reading list for a long time now. I have tried it before, but couldn&#x27;t go through all the bits and pieces. Is there any strategy you should follow while reading any source code? Are there any code walkthroughs of hadoop&#x27;s source code?
mpingover 8 years ago
I would recommend first of all that you get familiar with the names and versioning. Hadoop is a mess, and its basically a family of big data projects.<p>You have two&#x2F;three main components in hadoop:<p>- Data nodes that constitute HDFS. HDFS is Hadoop&#x27;s distributed file system, which is basically a replicated fs that stores a bunch of bytes. You can have really dumb data (lets say a bunch of bytes), compressed data (which saves space but depending on the codec you may need to uncompress the whole file just to read a segment), arrange data in columns, etc. HDFS is agnostic of this. This is where you hear names like gzip, snappy, lza, parquet, ORC, etc.<p>- Compute nodes which run tasks, jobs, etc depending on the framework. Normally you submit a job which is composed of tasks that run on compute nodes that get data from hdfs nodes. A compute node can also be an hdfs node. There are alot of frameworks on top of hadoop, what is important is that you know the stack (ex: <a href="https:&#x2F;&#x2F;zekeriyabesiroglu.files.wordpress.com&#x2F;2015&#x2F;04&#x2F;ekran-resmi-2015-04-23-08-34-00.png" rel="nofollow">https:&#x2F;&#x2F;zekeriyabesiroglu.files.wordpress.com&#x2F;2015&#x2F;04&#x2F;ekran-...</a>). So you have HDFS, and on top of that you (now) have YARN which handles resource negotiation within a cluster<p>- Scheduler&#x2F;Job runner. This is kinda what YARN does (please someone correct me). Actually its a little more complicated <a href="https:&#x2F;&#x2F;hadoop.apache.org&#x2F;docs&#x2F;r2.7.2&#x2F;hadoop-yarn&#x2F;hadoop-yarn-site&#x2F;YARN.html" rel="nofollow">https:&#x2F;&#x2F;hadoop.apache.org&#x2F;docs&#x2F;r2.7.2&#x2F;hadoop-yarn&#x2F;hadoop-yar...</a><p>Since hadoop jobs are normally a JAR, there are several ways of creating a jar ready to be submitted to an hadoop cluster:<p>- Coding it in java (nobody does it anymore) - Writing in a quirky language called Pig - Writing in an SQL-like language called HiveQL (you first need to create &quot;tables&quot; that map to files on HDFS) - Writing generic Java framework called Cascading - Writing jobs in scala in a framework on top of cascading called Scalding - Writing in clojure that either maps to pig or to cascading (Netflix PigPen) - ...<p>As you can imagine, since HDFs is just an fs, there are other frameworks that appeard that do distributed processing and that can connect to hdfs in someway: - Apache Spark - Facebook&#x27;s Presto - ...<p>And since there are so moving parts, there&#x27;s alot of components to put and get data on hdfs, or nicer job schedulers, etc. This is part of the hadoop ecosystem: <a href="http:&#x2F;&#x2F;3.bp.blogspot.com&#x2F;-3A_goHpmt1E&#x2F;VGdwuFh0XwI&#x2F;AAAAAAAAE2w&#x2F;CKt2D2xmRkw&#x2F;s1600&#x2F;EcoSys_yarn.PNG" rel="nofollow">http:&#x2F;&#x2F;3.bp.blogspot.com&#x2F;-3A_goHpmt1E&#x2F;VGdwuFh0XwI&#x2F;AAAAAAAAE2...</a><p>Back to your question, I suggest you spin your own cluster (this one was the best I found: <a href="https:&#x2F;&#x2F;blog.insightdatascience.com&#x2F;spinning-up-a-free-hadoop-cluster-step-by-step-c406d56bae42#.z6g3j1trf" rel="nofollow">https:&#x2F;&#x2F;blog.insightdatascience.com&#x2F;spinning-up-a-free-hadoo...</a>) and run some examples. There&#x27;s alot of details about hadoop such as how to store the data, how to schedule and run jobs, etc but most of the time you are just connecting new components and fine-tuning jobs to run as fast as possible.<p>Make sure you don&#x27;t get scared by lots of project names!
评论 #12392251 未加载