TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Challenges in building/maintaining/using data lakes

2 点作者 ekzhu大约 6 年前
What are the challenges you have encountered in building&#x2F;maintaining&#x2F;using data lakes?<p>We (data curation lab at Univ of Toronto) are doing research in data lake discovery problems. One of the problems we are looking at is how to efficiently discover joinable and unionable tables. For example, find all the rental listings from various sources to create a master list (union); or find tables such as rental listings and school districts that can be used to augment each other (join). The technical challenges in finding joinable and unionable tables in data lakes involve the following: (1) the data schema is often inconsistent and poorly managed, so we can’t simply rely on that schema; and (2) the scale of data lakes can be in the order of hundreds of thousands of tables, making a content based search algorithm expensive. We came up with some solutions that are based on data sketches with several published papers [1,2,3]. The python library “datasketch” was a byproduct if these work.<p>Many challenges remain though, and we would like to explore some of the more pertinent ones. In fact, we are conducting a survey to understand the current state of data lakes in industry and the challenges experienced. Would love to see what the HN community thinks about the current state of data lakes.<p>Survey: https:&#x2F;&#x2F;www.surveymonkey.com&#x2F;r&#x2F;R7MYXSJ<p>[1] http:&#x2F;&#x2F;www.vldb.org&#x2F;pvldb&#x2F;vol9&#x2F;p1185-zhu.pdf [2] http:&#x2F;&#x2F;www.vldb.org&#x2F;pvldb&#x2F;vol11&#x2F;p813-nargesian.pdf [3] http:&#x2F;&#x2F;www.cs.toronto.edu&#x2F;~ekzhu&#x2F;papers&#x2F;josie.pdf

暂无评论

暂无评论