TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

BTables: A fast, compact format for Machine Learning

37 pointsby thomsonover 9 years ago

6 comments

fjordsterover 9 years ago
HDF5 isn&#x27;t perfect but it does this kind of job pretty well. The C, C++, and HDF5 APIs are definitely not fun to use, but there are wonderful and intuitive APIs available in some languages---I&#x27;m thinking of Python&#x27;s h5py here.<p>Let me add that the OP&#x27;s experience that HDF5 files were less space efficient than comparable CSV files suggest that something was grossly amiss in his use of HDF5.
评论 #10322200 未加载
评论 #10322230 未加载
评论 #10323625 未加载
icsaover 9 years ago
The BTables discussion takes me back to memories of my first college computing class (in FORTRAN). We were asked how we might store a sparse matrix in less memory. The solution was exactly the same as BTables. We thought we&#x27;d done something novel when the professor pointed out that it had already been implemented in the 60s.<p>Great ideas never fade. They do get reinvented :).
rspeerover 9 years ago
I haven&#x27;t tried the BTables format, but I agree with their criticism of HDF5. It seems to be an incredibly over-designed format with under-designed APIs.<p>(Why would I need a directory tree inside a file that only one process can write to anyway? Why wouldn&#x27;t I just use the filesystem I already have?)
评论 #10321714 未加载
xaaover 9 years ago
It&#x27;s too bad that this is for sparse data only. ML datasets have differing degrees of sparsity, and when the sparsity gets low enough, it&#x27;s more efficient to use dense matrices, even when there are still missing values.<p>Also if you have dense data, you can use mmap, which isn&#x27;t very space efficient but is very fast. I guess it could also be made to be space efficient if you use a filesystem with transparent compression.
评论 #10322126 未加载
bltover 9 years ago
Wondering why they chose row-major storage. I think it&#x27;s far more common to only care about a subset of columns than a subset of rows.
评论 #10322195 未加载
zobzuover 9 years ago
interesting how it jumps from csv to rewrite stuff without just doing SQL and be done with it. since csv did the job almost good enough, it seem like SQL would just fine and dandy while easier to manage and implement (minutes, literally)<p>note: after reading a little more I suspect SQL would be faster, in fact.