If you are interested in Data Warehousing, you should read Ralph Kimball's "The Data Warehouse Toolkit": <a href="http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247" rel="nofollow">http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimens...</a><p>When I started learning about BI (Business Intelligence), a few members of the Pentaho community advised me to read this book. I'm glad I did. Kimball is one of the "fathers" of data warehousing, and his book had a lot of great insights for dimensional modeling. It helped me avoid many design mistakes while building my DWH, and gave me insight I might have taken years to discover.<p>It's a "theoretical" book, in the sense that it does not focus on any specific technology; it's also a "practical book", because he uses real-world scenarios (inventory management, e-commerce, CRM...) to demonstrate the various dimensional modeling techniques. I also liked the part about BI project management and encouraging BI in a company (= how to engage users and how to "sell" a BI project to management).<p>He also has a newsletter with many DWH design tips (archives here: <a href="http://www.kimballgroup.com/html/07dt.html" rel="nofollow">http://www.kimballgroup.com/html/07dt.html</a> ).
Oldies but goodies<p><a href="http://philip.greenspun.com/wtr/data-warehousing.html" rel="nofollow">http://philip.greenspun.com/wtr/data-warehousing.html</a><p>(data warehousing for cavemen)
Shopify is pondering open-sourcing our internal tool called Tiller. It runs all the reporting for our considerable data warehouse efforts, yet it's lightweight and super fast to get running.<p>Watch this space.
>If you're building an archive, your only requirements are to minimize storage cost and to make sure the archive can keep up with the generation of data.<p>And in some of the cases he mentions be really. <i>really</i> certain you don't lose data, since some of the laws impose criminal penalties on data loss, and not necessarily even on the most responsible parties (legislatures have been getting increasingly psychotic this way).
The last stage of enterprise integration with the DW is through Data Marts, which are organized into Dimensions and Facts, and allow for dynamic interfaces for business users to mine their data. My current project is using Informatica CDC (Change Data Capture) to read multiple source databases through their logs and aggregate in real-time. Its really incredible and enables any level of intricate reporting requirements.
This isn't a good article about data warehousing 101. I've been working in data warehousing since 2004. The core thing in DW is DWH data model because it's actually abstraction layer than converts raw transactional data into meaningful, consistent, correct and persistent representation of an organization's activity. Tools (including mentioned in the article) are just means to achieve that goal.
Hive is a really slick DW tool built on top of Hadoop. It has a SQL-like language and supports typical DW techniques like table partitioning, key clustering, etc.