What are the challenges you have encountered in building/maintaining/using data lakes?<p>We (data curation lab at Univ of Toronto) are doing research in data lake discovery problems. One of the problems we are looking at is how to efficiently discover joinable and unionable tables. For example, find all the rental listings from various sources to create a master list (union); or find tables such as rental listings and school districts that can be used to augment each other (join). The technical challenges in finding joinable and unionable tables in data lakes involve the following: (1) the data schema is often inconsistent and poorly managed, so we can’t simply rely on that schema; and (2) the scale of data lakes can be in the order of hundreds of thousands of tables, making a content based search algorithm expensive. We came up with some solutions that are based on data sketches with several published papers [1,2,3]. The python library “datasketch” was a byproduct if these work.<p>Many challenges remain though, and we would like to explore some of the more pertinent ones. In fact, we are conducting a survey to understand the current state of data lakes in industry and the challenges experienced. Would love to see what the HN community thinks about the current state of data lakes.<p>Survey: https://www.surveymonkey.com/r/R7MYXSJ<p>[1] http://www.vldb.org/pvldb/vol9/p1185-zhu.pdf
[2] http://www.vldb.org/pvldb/vol11/p813-nargesian.pdf
[3] http://www.cs.toronto.edu/~ekzhu/papers/josie.pdf