TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

BTables: A fast, compact format for Machine Learning

37 点作者 thomson超过 9 年前

6 条评论

fjordster超过 9 年前
HDF5 isn&#x27;t perfect but it does this kind of job pretty well. The C, C++, and HDF5 APIs are definitely not fun to use, but there are wonderful and intuitive APIs available in some languages---I&#x27;m thinking of Python&#x27;s h5py here.<p>Let me add that the OP&#x27;s experience that HDF5 files were less space efficient than comparable CSV files suggest that something was grossly amiss in his use of HDF5.
评论 #10322200 未加载
评论 #10322230 未加载
评论 #10323625 未加载
icsa超过 9 年前
The BTables discussion takes me back to memories of my first college computing class (in FORTRAN). We were asked how we might store a sparse matrix in less memory. The solution was exactly the same as BTables. We thought we&#x27;d done something novel when the professor pointed out that it had already been implemented in the 60s.<p>Great ideas never fade. They do get reinvented :).
rspeer超过 9 年前
I haven&#x27;t tried the BTables format, but I agree with their criticism of HDF5. It seems to be an incredibly over-designed format with under-designed APIs.<p>(Why would I need a directory tree inside a file that only one process can write to anyway? Why wouldn&#x27;t I just use the filesystem I already have?)
评论 #10321714 未加载
xaa超过 9 年前
It&#x27;s too bad that this is for sparse data only. ML datasets have differing degrees of sparsity, and when the sparsity gets low enough, it&#x27;s more efficient to use dense matrices, even when there are still missing values.<p>Also if you have dense data, you can use mmap, which isn&#x27;t very space efficient but is very fast. I guess it could also be made to be space efficient if you use a filesystem with transparent compression.
评论 #10322126 未加载
blt超过 9 年前
Wondering why they chose row-major storage. I think it&#x27;s far more common to only care about a subset of columns than a subset of rows.
评论 #10322195 未加载
zobzu超过 9 年前
interesting how it jumps from csv to rewrite stuff without just doing SQL and be done with it. since csv did the job almost good enough, it seem like SQL would just fine and dandy while easier to manage and implement (minutes, literally)<p>note: after reading a little more I suspect SQL would be faster, in fact.