Hadoop (and hence HDFS) is a stack of services designed to work together to serve a file system and manage jobs. The hadoop stack has a pluggable authentication/authorization by design. And yes, the default is "no security".<p>Given the distributed nature, HDFS runs on multiple machines. In linux distributed service security fits well with kerberos. Normally if you want a "secure" HDFS you must "kerberize" the services such that any hadoop operation requires a valid/authorized TGT.<p>To most people kerberizing a hadoop cluster is a major barrier to getting hadoop running. I dont see this changing but certain vendor hadoop distros break down some of the barriers.<p>Sometimes it is OK if you run a cluster insecure. Please dont do it if youre handling my financial or medical records though. As Mr.T once said 'dont write checks that yo ass cant cash'
even if node-to-node communication in a cluster (hadoop or otherwise) itself is not secured, is it not reasonable to secure external access to the cluster itself (i.e. with a firewall)?<p>from an outsider perspective (I've never used/run hadoop) I cannot see much reason for exposing the cluster to the outside world - either a web-app acts as an intermediary or access can be provided via VPN/ssh-tunnel/etc<p>... just curious why a fully/publically exposed cluster would be a "requirement"? or does it come down to the fact that firewalling an AWS environment is as painful (if not more) than "kerberizing" a [hadoop] cluster? (I kind of assumed AWS has firewalling functionality that is fairly plug'n'play ... a quick search does really back that up though)
I knew it was a bad idea to post 'getting started' tutorials that skipped all the security steps and replace them with a 'probably don't wanna do it this way in production' (and usually no documentation on how one should do it)...<p>Not levelling this comment at HDFS solely but it's about time people stopped with the 'hello world' style examples.