TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Using Awk and R to parse 25tb (2019)

88 点作者 xrayarx超过 1 年前

6 条评论

xnx超过 1 年前
Reminds me of &quot;Command-line Tools can be 235x Faster than your Hadoop Cluster (2014)&quot; <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=30595026">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=30595026</a>
corytheboyd超过 1 年前
Awk is such a nice little tool! It doesn’t even have to be an archaic one liner akin to a stack overflow answer. You can write well structured, easy to follow awk programs, that use variables, sane conditional logic, matching functions, etc. You can do all of this by referencing the man page that you already have, and nothing else. It’s a bit like something between bash and perl— enough functionality to accomplish non-trivial file processing tasks, but not a fully featured programming language. Which is perfect when it’s perfect.
评论 #37807619 未加载
laurent_du超过 1 年前
Querying by rsid is clearly a bad idea. You want to partition by chromosome (and sample-id in this case) and sort by position. When looking for a given snp, the parquet reader will go through the metadata to only read the data page that contains the given position. Unless your pages are huge, read time will be super small (and cost-efficient since you don&#x27;t fetch too much data). Since the data is static I would want to try storing all the sample data and metadata in arrays. (For non-static data you can&#x27;t do that because you won&#x27;t be able to edit the arrays later - you can only add new rows to the parquets.) I am not really sure I understand what the author is doing, sounds like he wanted to sort by position but he failed to do so and decided to bin instead? I agree that Awk is very useful in this kind of problems.
_a_a_a_超过 1 年前
I&#x27;m a database guy so everything looks like a database problem to me, but I&#x27;m not sure how this would fit in (as I&#x27;m completely unfamiliar with the data used here). Can anyone more knowledgeable than me suggest whether a database, on a conventional server with some decent RAM and a bunch of SSDs would have have worked and perhaps been cheaper?<p>(Edit: OK, SSDs in 2019 <i>might</i> not have been affordable but spinny disks were cheap and still pretty fast)
dang超过 1 年前
Related:<p><i>Using AWK and R to parse 25TB</i> - <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=20293579">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=20293579</a> - June 2019 (104 comments)<p>Recent and also related:<p><i>Exploratory data analysis for humanities data</i> - <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37792916">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37792916</a> - Oct 2023 (38 comments)
评论 #37822048 未加载
isoprophlex超过 1 年前
Now that DuckDB has S3 support, i guess a linux box, DuckDB and some light SQL&#x27;ing is all you need?