Most people learn by doing rather than reading (myself included) so just pick a project and start building.<p>This is the journey I took:<p>- Setup a Hadoop cluster from scratch (start with 4 nodes on virtualbox)
- Write software to crawl and store data on every single torrent. (I dont know why I picked torrents, it was just interesting at the time), but pick a single topic, and then scale it as far as you can.<p>(Can I store 100,000 torrent files? Can I crawl 200 websites every 5 minutes? Can I index every single file inside the torrent - whoops I have 500,000,000 rows now, can I distribute that across a cluster, can I upgrade the cluster without downtime? Can I swap Hadoop and HBase out for Cassandra? Can I do that with no downtime?) Why aren't all these CPU's being utilised? How can I use redis as a distributed cache? Now the whole system is running, can I scale it 2x, 5x, 10x? What happens if I randomly kill a node?<p>Just pick a single project - Astronomy Data, Weather Data, Planes in the air, open IoT sensors, IRC chat, Free Satellite Data, Twitter streams, pick a datasource that interests you and then your exercise is to scale it <i>as far as you can</i> - this is an exercise in engineering, not data science, not pure research, the goal is scale.<p>As you build this you'll do research and discover which technologies are better at scaling for reads, writes, difference consistency guarantees, different querying abilities.<p>Sure you could read all of this, but unless you apply it, much of it wont stick